Information search is one of the main components of human activity, we encounter it every day: studying a theater poster to select an interesting performance, choosing a convenient train from a train schedule, leafing through a phone book. A person, due to his profession or hobbies, often encounters the selection and search for any thematic information, sooner or later (with an increase in its volume) has to apply some principles of systematization and classification of available data, providing a more convenient and effective search. So, libraries compile a file cabinet: information about a book is written on a card in a certain way, a code is placed there - a few letters and numbers that can be used to determine the location of the book (storage, shelving, shelf); cards are arranged in alphabetical or thematic order. The use of computers provides greater opportunities for working with large amounts of information.

4.1. Key Definitions

Information Retrieval System (IPS) - software system   for storing, searching and issuing information of interest to the user (subscriber). The subscriber contacts the IPS with information request   - text reflecting information need this subscriberfor example, his desire to find a list of books on the theory of information retrieval or a list of pharmacies where you can buy the right medicine. Information is being searched in search array , which is formed (and updated as necessary) by developers or system administrators. Elements of the search array are entered into the information retrieval system in a natural (or close to it) language, and then they are usually exposed indexing , i.e. formal translation information retrieval language .

Indexing   - an expression of a central topic or subject of a text or a description of an object in an information retrieval language.

Thing   - an object (material thing, concept, property or relation) that is considered or mentioned in the document / information request.

Theme   document / information request - a section of science or technology, a field of practical activity or a problem to which a document / information request is devoted.

By the nature of the search array and the information issued, the IPS is divided into documentary   and factual .

Documentary IPS   intended for finding documents (articles, books, reports, descriptions of copyright certificates and patents) containing necessary information. The search array of such an IPS consists of search images of documents (i.e., elements, each of which transfers the main content of the document) or from the documents themselves. In response to the presented information request, the IPS issues a certain set of documents (or their storage addresses) containing the required information. A document is any meaningful text that has a certain logical completeness and contains information about its source and / or creator.

Factographic IPS   provides the issuance of directly the actual information requested by the consumer in the information request. The search array consists of factual entries, i.e. from descriptions of facts extracted from documents and presented in some formal language.

For example, if the Dating Service decided to create a documentary IPS, the search array would consist directly of letters from its clients such as: "My name is Ilya Muromets. I sat on the stove for 33 years, and now the king is in the guards ...".To create a factual IPS, according to customer letters, tables of the form would be filled out: "Last name - Muromets. Name - Ilya. Age - 33. Position - security". Accordingly, the request in the first case will be part of the client’s letter with wishes regarding his partner: “I want the bride younger than me, but I’m wise and that I’m interested in household chores”, and in the second - a table compiled from it: "Age в„ –33, intelligence - high, interests - household ".

Currently, factual IPSs (as a special class of search engines) are practically not developed, the actions they perform are implemented using regular DBMSs. Further, speaking IPS, we will keep in mind the documentary information retrieval system.

One of the most popular ways to translate a document into the internal language of the system is coordinate indexing   - assignment to a recruitment document keywords   or codes defining its content. Two indexing methods are possible: free, when keywords are extracted directly from the text of the document without taking into account all modifications of their forms and relations between them; and controlled, when only those words that are recorded in the search image of the document are included information retrieval thesaurus where their synonymous, morphological and associative relationships are indicated.

4.2. Thesaurus

Thesaurus   - A specially organized regulatory dictionary of lexical units of information retrieval and natural language. The lexical units of the information retrieval language are descriptors . The descriptor is placed in unambiguous correspondence to a group of keywords of a natural language, selected from the text of a certain subject area. For example, any (preferably the most frequently used or short) keyword or phrase or digital code can be selected as a descriptor. A multiple-valued word in a natural language corresponds to several descriptors, and several descriptive words and expressions correspond to one descriptor. The thesaurus takes into account the semantic relationships between words: antonyms, synonyms, hyponyms, hyperonyms, associations.

Synonyms   - words (phrases), different in spelling, but the same (in the subject area) in meaning: witch = evil sorceress. Antonyms   - words with the opposite meaning: kind - evil. Hyponym   - A term that is a special case of another, more general concept. Hyperonym   - a term, on the contrary, which is common to a number of other, particular concepts.

Soldier   \u003d hyponym ( military); person\u003d hyperonym ( military)

hyperonym ( cooks deliciously) \u003d hyperonym ( keeps the house clean)=

hyperonym ( can sew)= good hostess.

The State Standard for "Thesaurus Information and Searching Monolingual" defines following types   connections:

- genus species: means of transport - cart, carpet plane, walking boots, stove

- part-whole: wall, a door, chicken leg   - parts huts;

- cause-effect: lowered the sword - head off shoulders;

- raw material product: steel - sword;

- administrative hierarchy: the sultan - vizier - guard;

- subject process: execute - executioner;

- process object: execute - victim;

- functional similarity: emely stove - jeep Cherokee;

- property - property carrier: fire breathing - the Dragon;

- antonymy;

- synonymy.

An associative relation is a union of other relations that are not included in hierarchical relations or in synonymy relations (that is, any kinds of relations between words, possibly very specific, existing only in a certain subject area).

A dictionary entry (at an informal level) could look like this:

WISE \u003d smart

ANTONIM - stupid

HYPONYMS: knowledgeable, educated, clever, well-read

VID - an indicator of intelligence (high)

Thesaurus and grammar make up information retrieval language . The grammar contains the rules for the formation of derived units of the language (semantic codes, syntagms, sentences) and governs the use of means of designating syntactic relations (for example, communication indicators).

In the above fabulous information service   The thesaurus should describe all sorts of qualities and characteristics found in the letters of clients, the rules for their classification. The grammar and thesaurus should be composed in such a way that the system can understand what sets, say, the number indicated in the request: height, age or number of teeth (this can be determined by the keyword - unit of measurement), be able to distinguish the information provided by the client about yourself, from his requirements for a partner (phrases will help here i would like to meet, must match).

Based on the thesaurus and grammar rules, search images of the document and the request are formed (search order). Search Prescription   - text in the information retrieval language containing the characteristics of the documents requested by the user in the request.

Document Search Image   - text in the information retrieval language, put in unambiguous correspondence to the document and reflecting its features necessary to search for it upon request. In addition to search features that reveal the content of the document or, at least, determine its topic, the search image of the document usually also contains identifying and some additional information   (output, type of document, its language, etc.). Search prescriptions are generated when requests are received, and search images of documents can be created both when the system is replenished with new documents and when searching for an answer to a request. In systems where the flows of information are large and often updated, there is no need to spend resources on indexing, and the document itself or its name is often taken as a search image of the document.

4.3. Relevance

The purpose of the IPS is to issue documents, relevant   (semantically relevant) request (relevant in English - relevant). Distinguish relevance meaningful   and formal . Content relevance is interpreted as the correspondence of a document to an information query, determined informally (Vasilisa the Wise herself will read the letters of all the good fellows and choose candidates for suitors who meet her requirements), and formal relevance - as correspondence, determined algorithmically by comparing the search order and the search image of the document on based on information retrieval system delivery criteria .

Delivery Criteria   - a formal rule, a set of features by which the degree of formal relevance of the search image of the document and the search order is determined and a decision is made to issue / not issue a certain document in response to an information request.







o Relevance Relevance o

: Documents



an array

AT automated systems   the search is based on formal relevance, the substantive relevance in them is determined, for example, by expert evaluations and used to obtain data on the effectiveness of information retrieval in the system   (the quality of her work). As a criterion for issuing, a complete coincidence of the search images of the document and the query can be selected, the inclusion of multiple query keywords in the multiple keywords of the document, the intersection of these sets, etc.

In the example under consideration, when choosing as the criterion for issuing a complete coincidence of the keywords of the document and the request, the client should be provided with letters of characters that fully meet his requirements. It is unlikely that this will satisfy them, since obviously the choice will not be too great. This criterion would be more suitable for a system where accuracy is necessary, for example, which determines the choice of medication for the treatment of a certain disease (let there be few, but all are suitable), but the criterion for intersection is probably appropriate here.

Descriptors can be weighted depending on the degree of their compliance with the request; when searching, the coefficients of the descriptors found in both the request and the document are summed up, and documents are issued depending on the value of this amount (for example, if it exceeded a certain value). Thus, if you indicate that the most significant are the characteristics wealth   and power, but not kindness   and age, can be obtained in the suitors of Koshchei the Immortal. When using weights, it can also be used. layered delivery- the selected documents are presented to the user not in an arbitrary order, but according to the degree of relevance (in descending order of the weights), the user has the final choice of relevant documents.

An ideal IPS should produce documents that are substantively relevant to the request, and nothing but them. However, in practice this is usually not achieved; IPS silence (non-issuance of a certain number of relevant documents) and noise (extra documents) are observed. An array of documents is divided into issued by   and not issued   - by one criterion, and on relevant   and irrelevant- differently.

Thus, for each request we get 4 groups of documents:

The ratio of the number of documents in each of these groups determines the effectiveness of information retrieval. The following characteristics are used to evaluate effectiveness:

Completeness of issue \u003d


Rv + Rn

Delivery accuracy \u003d


Rv + HB

Loss of information \u003d


Rv + Rr

Information noise \u003d


x 100%

Rv + HB

Sensitivity \u003d


Rv + Rn

Specificity \u003d


Nn + Nv

In an ideal IPS, Рн \u003d Нв \u003d 0, and therefore completeness and accuracy \u003d 100%, and noise \u003d 0 (all documents were found and not a single extra). In real systems, the completeness coefficient reaches 70%, and the search accuracy coefficient fluctuates over a very wide range, sometimes falling to 10%. The values \u200b\u200bof these coefficients depend on a number of factors: both the internal properties of the search engine itself (the volume and characteristics of the information array, the information retrieval language, the criteria for issuing), and many “external” conditions: the degree of specificity of information requests, the user's ability to correctly formulate their information needs in a natural language, the correctness of building a specific request, and also from the subjective presentation of the user about what the information he needs. Due to errors and inaccuracies that occur at each stage of the operation of both the user and the system, the results can differ greatly from what the user wanted to get by referring to the IPS.

There is a concept search stability   - a characteristic of the change in completeness and accuracy with small (semantically insignificant) changes to the request. The average values \u200b\u200bof completeness and accuracy for a particular system are usually calculated by testing it on a reference document base.

Different criteria for issuing are selected depending on the requirements for the quantity and quality of information issued by the IPS. If it is important not to miss necessary information (patent examination) - it is necessary to increase completeness, if it is necessary to reduce the volume of information issued (library) - it is necessary to improve accuracy.

The English scientist S. Cleverdon revealed an inverse relationship between the completeness and accuracy of the search in one system (when using the same information retrieval language), i.e. an increase in accuracy leads to an increase in noise and, conversely, when the noise decreases, accuracy decreases. Both of these indicators can be improved simultaneously, only by making changes to the information retrieval language, making the grammar and thesaurus more linguistically developed. Moreover, achieving the highest possible search completeness is associated with enormous difficulties. The last 5-10% require the same complexity of the language apparatus of the system as the previous 90-95%, which entails an increase in the complexity of processing the input information and the search time.

4.4. Language component

The more detailed processing of the text of the document helps to a large extent to increase the effectiveness of the IPS. So, there are systems that, for simplicity, accept the name of the document as a search image of a document, however, due to various circumstances, it does not always formally reflect the content of the text. For example, in the preparation of this material, the article “A Eye Like an Eagle” was used, which has nothing to do with ornithology or oculists. Of great importance is the use of programs that produce linguistically meaningful processing of texts in a natural language (taking into account morphology, syntax). Only with their help it is possible to establish whether similar words (almost all letters are the same) are forms of the same word or whether they are completely different words that correspond to different semantic units.

More primitive, surface-based techniques can fail the IPS developer. So, if the system does not take into account any rules of the Russian language and works with templates (such as var *, text * .exe), then when searching for a Cavalier cavalier who is interested in ballroom dancing, you will have to choose a template keyword ball* (so that there is no loss of information, otherwise you can skip this characteristic expressed by the words i like to dance at balls) Then, as a result of the search, she may be invited to meet all lovers ballet, balyka, Balmont, Balzac, with everyone living around Of the Baltic   seas in homes with balconyas well as with all kinds of pampers   and minions of fate.

All of these applicants will be eliminated if the adjective is specified as a keyword. ballroom   and the system will be able to recognize it in all its forms (the use of morphological analysis of words also makes it possible to reduce the volume of the thesaurus, saving it from redundant information - otherwise all forms of one word must be defined as synonyms). Another way to reduce noise and improve accuracy is to introduce work with cognate words into the information retrieval language of the apparatus. In our example, when setting the root key ball   only documents containing different forms of words would be issued ball   and ballroom. However, in this case, the letter of the coveted prince is lost between messages about ball gown salons, ballroom owners, musicians and waiters, serving balls. Using parsing, you can more accurately define phrases (for example, recognize them not only when words are behind each other, but also when they are separated by a number of other words). In the above example, in a system with a syntax component, one could search for documents with phrases ballroom dance   and to dance at the ball. Of course, and this does not provide 100% accuracy (for example, nothing prohibits the issuance of messages about ballroom teachers), however, it is clear that the number of documents issued will be significantly reduced, and Cinderella will no longer turn into an old maid, looking at the information offered to her by the system.

Developed information retrieval languages \u200b\u200ballow the use of logical connectives: fool\u003d NOT ( smart), good fellow=(the man) AND ( young) In the long term - the possibility of describing in the information search language the meaning of the whole phrase (which does not always consist of the meanings of the words included in it) and the possibility of formulating the corresponding semantically complex queries.

Note that in advertisements or search engine reviews, one often finds the words “indexing” or “indexing”. There, these terms mean the creation of a common glossary throughout the array to increase search speed. For the entire text base, a list of terms found in it is compiled, and each of them is associated with a certain index (coordinates in the text base); most often this is the document number and the word number in the document. Upon receipt of the request, the word is first searched in this list, and the necessary documents are issued by the coordinates found. If there are several words in the query, an intersection operation is performed on their coordinates. This is how the search for articles that include a given word is organized in the Windows help subsystems.

The Internet is growing at a very fast pace, so finding the right information among hundreds of billions of Web pages and hundreds of millions of files is becoming increasingly difficult. To search for information, special search enginesthat contain constantly updated information on the location of Web pages and files on hundreds of millions of Internet servers.

When searching for information, it is necessary to answer three questions: what to look for, that is, what sources of information, where to look (where these sources are located) and how to search (what tools to use for this).

What are the main sources of information presented on the Internet? These are WWW documents, articles in newsgroups and mailing lists, files in file libraries, directories of address information of organizations and people (e-mail, address, phone), articles in thematic databases, encyclopedias.

Where are these sources of information located? These are popular Internet resources such as WWW, newsgroups, mailing lists, and FTP servers.

Of course, you can search for the necessary sources of information manually, find out the addresses from specialized magazines on computer science and the Internet, use special paper directories with addresses classified by categories.

However, for such a volatile space as the Internet, it is necessary to learn how to use special tools, the purpose of which is to collect data on information resources and provide users with a service quick search.

IPS (information retrieval system)   - This is a system that provides the search and selection of the necessary data in a special database with descriptions of information sources (index) based on the information retrieval language and relevant search rules.

The main task of any IPS is to search for information relevant to the information needs of the user. It is very important as a result of the search that you don’t lose anything, that is, find all the documents related to the request and not find anything superfluous. Therefore, a qualitative characteristic of the search procedure is introduced - relevance.

Relevance is the relevance of search results to a formulated query.

Internet search engines can be divided into two groups:

- search engines general purpose;

- specialized search engines.

General Purpose Search Engines

The general-purpose search engine interface contains a search field and a list of catalog sections. The following search tools for WWW are distinguished: directories, search engines, metasearch engines.


Catalog   - A search engine with a classified by topic list of annotations with links to web resources. Classification is usually carried out by people.

The search in the catalog is very convenient and is carried out by sequentially clarifying topics. However, directories do support the ability to quickly search for a specific category or page by keywords using a local search engine. The database of links (index) of the catalog usually has a limited amount, manually filled in by the catalog staff. Some directories use automatic update   index.

The search result in the catalog is presented in the form of a list consisting of brief description   (annotations) of documents with a hypertext link to the source.

Addresses of popular directories:

1 Foreign catalogs:

a) Yahoo -;

b) Look Smart -;

c) Magellan -;

d) eNET -

2 Russian catalogs:

a) Aport (Internet Constellation) -;

b) AU -;

c) Weblist -;

d) Snail -

In the search engine database, Web sites are grouped into hierarchical subject catalogs, which are analogues of the subject catalog in the library.

Topical thematic sections, for example: Internet, Computers, Science and Education, and so on, contain subdirectories. For example, the Internet directory may contain subdirectories of Search, Mail, and others.

Searching for information in a directory comes down to selecting a specific directory, after which the user will be presented with a list of links to the Internet addresses of the most visited and meaningful Web sites. Each link is usually annotated, that is, it contains a short comment on the content of the document.

The most comprehensive multi-level hierarchical thematic catalog of Russian-language Internet resources has the Aport search engine ( The catalog contains a detailed annotation of the content of Web sites and an indication of their geographical location.

Search engine

Search engine   - A search engine with a database formed by the robot containing information about information resources.

Distinctive feature search engines   It is the fact that a database containing information about Web pages, Usenet articles, and so on, is generated by a robot program.

Search in such a system is carried out at the request made by the user, consisting of a set of keywords or phrases enclosed in quotation marks. The index is generated and maintained up to date by indexing robots. For example, to search the search engines themselves on the Internet, you can enter the keywords " russian system   Internet Information Search. "

Some time after sending the request, the search engine will return a list of Internet addresses of documents in which the specified keywords were found. In the description of the document most often contains the first few sentences or extracts from the text of the document with highlighting keywords. As a rule, the date of updating (verification) of the document, its size in kilobytes is indicated, some systems determine the language of the document and its encoding (for Russian-language documents).

To view this document in a browser, just activate the link pointing to it.

If the keywords were not selected correctly, then the list of document addresses may be too large (it may contain tens or even hundreds of thousands of links). In order to reduce the list, you can enter additional keywords in the search field or use the search engine catalog.

Many search engines allow you to search in found documents, and you can refine your request by introducing additional terms. If the intelligence of the system is high, you may be offered the service of searching for similar documents. To do this, you select a document that you especially like and indicate it to the system as a role model. But often this function does not work in accordance with your expectations. Some search engines allow you to re-sort the results. To save your time, you can save the search results as a file on local drive   for further study offline.

Addresses of the most popular search engines abroad and in Russia:

1 Foreign search engines:

a) Google -;

b) Alta Vista -;

c) Excite -;

d) HotBot -;

e) Nothern Light -;

f) Go (Infoseek) - (;

g) Lycos -;

h) Fast -

2 Russian search engines:

a) Yandex - (or;

b) Rambler -;

c) Aport -

One of the most complete and powerful search engines is Google (, in the database of which 8 billion Web pages are stored and every month robotic programs enter 5 million new pages into it. In RuNet (the Russian part of the Internet), extensive databases containing 200 million documents each have search engines Yandex ( and Rambler (

Metasearch engine

Please note that different search engines describe a different number of sources of information on the Internet. Therefore, you can not be limited to searching only one of these search engines. Now we will get acquainted with search tools that do not form their own index, but are able to use the capabilities of other search engines. These are metasearch systems ( search services) - systems capable of sending user requests simultaneously to several search servers, then combine the results and present them to the user in the form of a document with links.

Metasearch systems do not have their own database. They are programs that accept a user’s request, process this request using artificial intelligence algorithms, and then search for search engines. That is, they are search engines   search engines. The advantage of these systems is their ability to synthesize the purpose of the search, and not just conduct a search in accordance with a verbal query. The results of such a search are understandable to the user and most closely match what he is looking for. Metasearch sites offer a huge number of options, trying to be useful to any user. Are available different versions   meta-search engines that constantly browse the Internet for information that matches your search criteria.

When the system finds new information, it warns you or automatically downloads it. If you want to find sites dedicated to general issues, travel and so on, then meta-search engines will allow you to quickly access the necessary information. They also offer direct access to sites with specific information, such as telephone directories, travel guides and government sites. The working hours of metasearch engines are usually slightly increased, since they interrogate other search engines. It makes sense to contact them when conventional search engines have failed.

Addresses of famous metasearch systems:

- MetaCrawler -;

- SavvySearch -

Currently, the majority of Russian state organizations and commercial firms are characterized by the absence of an orderly system of paperwork, despite the fact that it is a rational and clearly organized paperwork that defines the documentary support of the organization’s management that can significantly increase the efficiency of the enterprise. Organization of work with documents is important an integral part   management processes and managerial decision making, significantly affecting the efficiency and quality of management. The process of making a managerial decision consists of: obtaining information, processing it, analyzing, preparing and making a decision. All these stages are closely related to the search for documents and information about them. This is the role of the Information Retrieval System (IPS) for working with documents of an organization. Reliability and quality of management depend on the quality and reliability, speed of reception and transmission of information, the correct formulation of a reference and information service, a clear organization of the search, storage and use of documents. The main objectives of management documentation are: reducing information flows to the optimal minimum, ensuring the simplification and cheapening of the processes of collecting, processing and transmitting information using latest technology automation of these processes. Thus, it is vitally important for any organization to constantly improve management documentation, as this directly affects the quality of management decisions. Tasks of the IPS consist of several stages, which we will discuss in this course work.

Purpose of work: the objective of this coursework is to get acquainted with the object of information retrieval systems of the organization. Find out the goals, characteristics and scope of use of this object of study.

Research Objectives:

1. Describe the types of information retrieval systems

2. Registration of documents as the basis of information retrieval system

3. Other sources for building IPS

4. Technology

document management information management

1. Types of Information Search Engines

Organizations create manual-type IPS, mechanized and automated. IPS includes registration and indexing of documents, information retrieval arrays created on their basis (file cabinets, arrays on machine media), and operational storage of documents.

To achieve information compatibility of search arrays of industry organizations, a centralized development of classifiers is required: a typical nomenclature of cases; classifier of correspondents; classifier of structural units (in the presence of standard structures); classifier of names of types of documents; classifier of issues of the organization; classifier of issues contained in proposals, statements, complaints of citizens, etc.

Intersectoral information compatibility of the IPS is ensured by the application of OK TEI; when using intrasystem classifiers, it should be possible to switch to reference codes or reference and reference codes on which OK TEI codes are used, which track the progress of documents or their use. The basis of systematization in these arrays is how several independent types of IPS are distinguished: as a rule, the date (term) of performance;

2 reference documents on documents of limited access, as a rule, in numbered, stitched and sealed magazines;

3 reference certificates on proposals, applications and complaints of citizens, in which the basis for systematization is the topic of issues raised in citizens' appeals;

4 reference (codification) on normative legal acts reflecting the issues of activity (legal environment) of the organization. In systems of this type, each issue about which there is information in the document is taken into account independently, and the subject of regulatory provisions is also the basis for systematization. When documents are withdrawn from circulation or canceled, the information in the IPS is canceled, but not destroyed, and transferred to the organization’s archive together with documents .

Reference card files are divided into two parts: unexecuted and executed documents, RKK which are systematized according to the following criteria:

subject-matter or thematic (in accordance with the content of the documents or the field of activity to which the documents relate);

1. according to the nomenclature of cases (in accordance with the names of cases according to the nomenclature of cases or their indices);

2. correspondent (according to the names or symbols of organizations with which correspondence is conducted);

3. by performers (by structural units);

4. alphabetical (in alphabetical order of surnames, names of objects or objects);

5. geographical (according to the names of administrative-territorial units);

6. nominal (by name of types or varieties of documents);

7. registration (in order of increasing registration numbers of documents).

The choice of a search attribute is determined depending on the types of documents and the nature of information requests.

The first part of the file cabinet is used to search for information about documents in the process of their execution. The second part of the file cabinet is used to search for executed documents.

As documents are executed, RKK with the necessary marks move from the first part of the file cabinet to the appropriate sections and headings of the second;

Depending on the volume of workflow, the system of registration and control over the execution of documents, search tasks, a single reference IPS or several independent ones can be conducted. Separate file cabinets (databases) are formed on incoming documents, initiative outgoing documents, citizens' appeals. With a large number of regulatory legal acts and administrative documents used in the activities of the organization, separate codification file cabinets (databases) can be created on them.

The list of database names is similar to the list of card index names.

Accounting for the volume of workflow:

1. workflow - the number of documents received (incoming) and created (internal, outgoing) by the organization for a certain period of time;

2. The number of documents is calculated according to registration forms at the places of their registration.

One copy of the document is taken as the accounting unit, excluding copies created during printing and reproduction. Each document is counted once. The appendices to the document are taken into account together with it as one document.

Separately taken into account received and created by the organization documents, citizens' appeals.

The duplicated copies (copies) are recorded separately according to the work logs in the typewritten and copying and duplicating bureaus and (or) according to the mailing lists.

The organization can carry out a full and selective accounting of the volume of workflow (throughout the organization, structural units, groups of documents, etc.).

Accounting and analysis of the volume of workflow in the organization are carried out under the supervision of the DOE service;

The results of accounting for the volume of workflow are summarized by the DOE service and presented to the organization’s management to develop measures to improve work with documents;

Information on the volume of workflow is used to establish the structure and staffing of the DOW service, select the technology for working with documents and office automation tools, determine the degree of workload of the DOW service and individual employees. Information retrieval systems play a significant role in solving the most important tasks of archival institutions: intensifying archival heuristic processes, increasing the speed and effectiveness of solving search problems on all topics and sets of documents, at all levels of search; expanding user access to document information (since access restriction is often not associated with the presence of a confidentiality stamp, but is caused by the insufficient quality of the scientific and reference apparatus, which significantly complicates the work of researchers); increasing the intensity and efficiency of use of archival documents in all forms of use, a variety of information services provided by archives, including on a contractual basis; the development of inter-archival and international cooperation on the basis of information exchange, the implementation of joint projects to introduce significant complexes of historical sources into the scientific circulation. The information search theory began with a study of the features of documentary information retrieval systems (IPS). Information retrieval in such systems is understood as a certain sequence of operations performed with the aim of finding documents (articles, scientific and technical reports, descriptions of copyright certificates and patents, books, etc.) containing certain information (followed by the issuance of the documents themselves or their copies), or for the purpose of issuing factual data, which are the answers to the questions asked.

2. Registration of documents - the basis for the construction of IPS

Registration of documents - fixing the fact of creation or receipt of a document by affixing an index1 on it, followed by recording the necessary information about the document in registration forms.

The document index consists of a serial number within the registered document array, which, based on the search tasks, is supplemented by indexes on the nomenclature of affairs, the classifier of correspondents, executors, etc. The following (or reverse) sequence is observed in the document index component parts: serial registration number, index on the nomenclature of affairs, index on the used classifier. The constituent parts of the index are separated from each other by a slash.

Registration is subject to all documents that require accounting, execution and use for reference purposes (administrative, planning, reporting, accounting and statistical, accounting, financial, etc.), both created and used internally, and sent to other organizations and coming from higher , subordinate and other organizations and individuals. Registration is subject to both traditional typewritten (handwritten) documents and those created by computer technology (machine-readable, typewritten).

Documents are registered in the organization once: incoming - on the day of receipt, created - on the day of signing or approval. When a registered document is transferred from one unit to another, it is not re-registered. Documents are registered within groups depending on the name of the type of document, its author and content. For example, the orders of the head of the main activity, the orders of the head of the personnel, the orders of the head of the higher organization for the main activity, the decisions of the collegium of the higher organization, the audit reports of financial and economic activities, accounting reports, reports of enterprises, acts of implementing the results of scientific developments, work plans of subordinate organizations are separately recorded. organizations, applications for logistics, etc. Ordinal registration numbers are assigned to documents within each registered group. Registration of incoming and created documents is carried out mainly centrally: at the places of creation and execution of documents. For example, planning documentation is registered in the planning department, procurement documents in the procurement department, minutes and decisions of the board in the secretariat, administrative documents on core activities signed by the head of the organization, documents received from higher organizations or forwarded to him, in the documentation support service etc. The place of registration of the document is fixed in the instructions on the documentation support of management and in the sheet of documents of the organization. Documents are registered on cards. In order to achieve information compatibility of registration data and create conditions for the transition to automated registration, the mandatory composition of registration details is defined: author (correspondent), name of the type of document, date of the document, document index (date and index of receipt of the document) 2, title of the document or its summary, resolution (executor, contents of the order, author, date), deadline, mark on execution (a brief record of the decision on the merits, date of actual execution and index of the response document, case number). The composition of mandatory details, if necessary, can be supplemented with the following: performers, signature of the performer certifying receipt of the document, progress, mark on applications, etc. The order of location of details on registration forms and the use of the reverse side of registration and control cards is determined by the organization itself. - control cards is determined by the number of reference and control cards in the structural units where the document will execute I kontrolirovatsya.Registratsiya documents responses carried out on the registration forms of initiative documents. The response document is assigned an independent serial registration number within the corresponding registered array. Suggestions, applications and complaints of citizens are registered on the registration and control cards of the established form. For registration of documents can be used photocopying and computer technology. In an automated IPS, registration of documents is carried out using a machine-oriented registration and control card, built on the basis of mandatory registration details or by directly entering them from a document. Reference card files are formed from registration and control cards depending on the tasks of specific IPS in accordance with the classifiers used. The following file cabinets are compiled:


reference and reference;

geography of incoming documents;

on proposals, statements and complaints of citizens;

thematic (orders, decisions), etc.

Information retrieval arrays of automated IPS are formed on the basis of information about documents recorded on computer media.

Types of registration accounting forms:

In practice, they use various registration accounting forms: journal, card, computer. The journal form of registration is used by organizations and structural divisions, where the document flow is less than 500-600 documents per year. All sheets of magazines are numbered in the upper right corner, starting from the second. All sheets are stitched with durable threads. The ends of the threads are displayed on the last numbered sheet (reverse side). Using a paper square larger in diameter than a round stamp, glue the midpoints of the ends of the thread. An organization seal is placed on top with the capture of part of the square or directly in the middle of the square. On the same page they make an inscription certifying the correct design. For example: “In this magazine, numbered, laced and fastened round stamp   90 (ninety) sheets. "The inscription is certified by the personal signature of the clerk (secretary), indicating the position, and decrypting the personal signature (surname, initials). After the entries in the journal are finished, they write, for example, with the content," 372 are registered in this journal document number 1 to number 372. Number 41 is accidentally skipped. " The record is certified by signature. All magazines have numbers according to the nomenclature of affairs. On the cover of the magazine should be written: the name of the magazine, the name of the organization, registration number according to the nomenclature of affairs, in the lower right corner - 00.00.00 has begun and 00.00.00 has finished. The start and end numbers of the journal are set by the registration of the first and last document in the journal (within the calendar year). The journal registration form has its technological disadvantages:

The formal nature of securing the document to the gross serial number;

Difficulties in keeping a journal of search, reference and control work;

The difficulties of compulsory multiple registration of documents;

The inability to reflect the movement of the document in the process of consideration and execution.

To register documents, a single registration and control card is used. Cards are produced by printing on paper in A5 format paper (148x210 mm). For faster orientation, you can apply color differences by category of documents. In this case, cards are made with colored stripes along the top margin. A color form of the card is possible: for incoming documents - green, outgoing - blue, internal - yellow. The registration graphs on the card, in the journal and computer have no significant differences.

The organization that governs the instructions for paperwork:

Location of details on the card;

Areas for recording;

The procedure for using the reverse side of the card;

Composition of additional details (determined by the organization);

The number of completed copies of cards (determined by the number of reference, control and reference reference files that have offices and structural units).

Help file

The reference card file consists of 2 parts and contains:

Cards for unexecuted documents, incl. sent to structural units for familiarization and study;

Cards for executed documents.

The card file of outstanding documents includes the following groups of cards:

On the received documents;

For documents systematized by correspondents;

For documents of internal use, grouped by structural units, managers, specialists;

For documents combined by the date of receipt.

Part of the file cabinet for executed documents is formed according to the scheme of the nomenclature of cases or according to the lines of activity of the organization. Inside the sections, the cards are arranged according to the date of receipt. The information department of the reference and information card file, based on the executed offers, applications, complaints, is built in alphabetical order of the names of the applicants.

Working with file cabinets involves:

Timely receipt of cards for newly registered documents;

Regular filing operations;

Entering additional information into the cards (resolutions, marks on the transfer to specialists, performance, etc.);

Making changes to records;

Rearrange cards from one section of a file cabinet to another.

With the proper management of file cabinets, reference and search work   carried out promptly and efficiently.

Fully all the new opportunities for working with electronic documents are implemented in the above systems. Despite their different purposes, they work with documents and perform all necessary operations in accordance with the regulatory documents on the basis of which they were created. Proper organization   workflow in the company allows you to implement these systems with minimal costs. As a result, all normative and current documentation, the order of movement of documents, is saved without loss. Additionally, it opens the opportunity to include in the document management system new types of documents that previously could not find their place in the nomenclature of cases. New types of documents need to be considered, such as emails, cinema-- and photo materials, scanned documents, electronic catalogs, partner and customer sites, chat recordings, eBooks and other electronic information. This leads to an increase in the volume of workflow, an increase in disk space   and reduce the speed of information processing. All these shortcomings are objective, but they will not cause difficulties in the company’s work if the nomenclature of cases is competently compiled, types of documents, their details and attributes are approved, and storage periods are set. This must be done for another reason. New opportunities for quickly obtaining complete information, group processing of documentation, using the database as a source for creating a document, and other features will be reduced to zero without this work. The result may be even worse if you do not. The increased flow of information in the form of documents entered into the system without details and attributes will lead to a situation of complete chaos over time. Searching the database for information without details and attributes is becoming an art available to few. Until now, all hopes of reducing paper documentation using computers have been fruitless. This is a natural result of storing documents in file structure   operating system. Virtually all electronic documents stored in operating systemare not; this is a bunch of drafts prepared for printing, and since the speed of preparing drafts for printing has increased by an order of magnitude with the help of the copy, scan, use of templates and network sharing of drafts, paper is now spent ten times more. Modern systems can radically change the situation. All originals are stored in the system, protected by access rights, digitally signed and the procedure for making changes, and copies are printed only in case of emergency. Particularly noteworthy are the new features of not only a quick search for the desired document, but also an analysis of the incoming information. Opportunities for monitoring the execution of orders made in orders, instructions and at meetings, automatic notification of all interested parties about the upcoming deadline or the beginning of the meeting. Requisites and attributes of documents allow you to do work that was previously impossible due to the large number of documents and information. All this makes it possible to prepare the manager with more complete information for making a decision and submit documents as they become available. Managing documents using attributes allows you to sort and group documents by degree of importance, subject, units. It is always possible to see the status of a document by the stages of its preparation and make a timely decision on the allocation of resources. The role of secretary and clerk acquires elements of creativity, the tasks set for you can be solved in the system different ways, and the result may exceed expectations. The analysis capabilities are very extensive, they can serve the interests and needs of literally every employee of your company. The product life cycle management system is built for technical workflow, but its capabilities more than cover the needs of conventional workflow. This allows you to put together all the internal documentation. To do business in the company, this is very important in terms of reducing the cost of maintaining the archive. The concept of a single information space did not develop immediately. It all started with the introduction of process management. When the task is to ensure the process from order to production, any delays in this way bring losses. Because of the need to provide information on the main and auxiliary processes of the enterprise, requirements for a single information space and enterprise management system began to be determined. The main principle: the information must be reliable and relevant. Information must be necessary and sufficient. Information should not be duplicated. If your company implements or has already implemented information Systems, knowledge of the basics of a single information space will preserve your health and nervous system. Consider how these principles work. There seems to be no problem with the reliability of the information, but in fact they are. The Russian and English alphabets have many letters of the same type. If you confuse letters at work, it is very difficult to detect such an error, and the system considers the error when searching as a new word, and you lose the necessary information or duplicate it. It is very important to use reliable materials when you add new entries to the system, and reliability must be checked before entering data. Over time, information loses relevance, it is necessary to monitor its condition, maintaining relevance and reliability. For example, the attributes “priority” and “importance” require constant attention. The props "name" does not change and does not require attention. Necessity and sufficiency is also a very important principle. If you fill the system with documents, the need for which is doubtful, then overload the system with garbage. On the other hand, the information should be enough for the company to work. For successful work, it is necessary to check for necessity and sufficiency. This will allow over time to create an excellent archive as the intellectual property of the company. Avoiding duplication of information is more difficult. To do this, the system itself must be protected against duplication, but employees must also understand their responsibility. First of all, all regulatory information used by the company should not have duplicates. Duplicates lead to serious errors when making changes. The standard situation: one was fixed, the other was forgotten, but they did not know about the third that it exists. If the job is structured so that you want to create duplicates, rebuild the job. If you select previously entered information in the system, check if it is the only one? If duplicates are detected, they must be blocked or deleted. AT modern systems   there is protection, but it does not work if the double is entered with an error, and this is very dangerous in the future. A distinction should be made between a duplicate of a document and a duplicate link to a document. The first leads to an increase in the occupied disk space, the second helps to organize the system workplace. There can be many links to a document in the system, but all of them point to one document. You make changes to the document, and everyone who opens the document through their link will see your changes. The information space is unified if all employees of the company use common reference data, a single archive of documentation, and a single regulatory framework to carry out their work. But this is not enough, other programs in your company should use this data. You should not allow duplication. The question arises: why, with all the evidence of the benefits of new opportunities, is it so difficult to introduce new technologies? There are natural reasons for this. For many years we have been importing management methods, but only now they are beginning to be used in practice. Technique and programs immediately began to be used in production, ahead of the introduction of new methods. Result - 80% of failures during implementation. But attempts to create a single information space continue. Now it is seen as an internal competitive advantage. It consists in the speed of passing documents, in their reliability, in increasing the volume of processing for the same time by the same employees, in the possibility of analyzing and making forecasts. The speed and quality of information processing has become a commodity. In the future, such systems will exist everywhere.

3. Other sources for building IPS

The list of cases without fail includes all IPS for documents and publications. Depending on the production necessity and degree of access restriction, documents may be grouped into cases separately or together with other documents on the same issue.

In cases when a large number of the same types of documents and files (orders, instructions, summaries, etc.) are formed in the organization with the bar of access restriction and without this bar, it is advisable to provide for their separate formation in the case. At the same time, in the nomenclature column “Case Index”, the indicated restriction of access is added to the case number with documents.

When a document of limited access is included in the case of unclassified documents that do not have a similar signature stamp, this case receives an access restriction mark, and an appropriate clarification is made in the list of cases of the organization (institution). In organizations in whose activities a small number of documents of limited access are generated, the list of cases the institution of one case may be provided. For example: “Documents“ For official use ”. The shelf life of one such case is not established. and in the corresponding column of the nomenclature of affairs is marked “EC” (expert commission). At the end of the clerical year, a limited access case is reviewed by an expert commission of the organization (institution) sheet-by-sheet and, if necessary, a decision is made on the regrouping of documents. The documents of long-term and permanent storage contained in the case are grouped into a separate case, which receives an independent heading and is additionally included in the list of cases. If documents include only temporary storage periods, it may not be reorganized. The storage period for such a case is set by the maximum term of the documents contained therein. In dealing with documents of open record keeping. in which, as materials accumulate, limited information is concentrated, should be assigned to a similar category. On the covers of these cases is also affixed with a bar for access restrictions. Corresponding clarifications are made in the nomenclature of cases.

Each case included in the nomenclature must have an index. The case index consists of the index of the structural unit (according to the classifier of structural units) and the serial number of the case within the distribution. If there are several volumes (parts) in the file, the index is affixed on each volume with the addition of “t. 1 "," t. 2 "etc. The nomenclature of affairs in the organization is compiled according to established standards, printed in the required number of copies. The first copy of the organization's nomenclature of files is stored in the management documentation support service (DOW), the second is used as a worker in the DOW service, the third is used in the departmental archive, with which the nomenclature of affairs was agreed upon and where the documents are received for permanent storage. The nomenclature of cases is compiled and reconciled in the event of fundamental changes in the functions and structure of the organization, but at least once every five years. It is annually specified, approved by the organization’s management and put into effect on January 1 of the following year. During the year, information on the establishment of cases, on the inclusion of new cases, etc., is entered into the approved nomenclature of cases. At the end of the year, final information on categories and the number of cases filed is entered into it. The final record of the nomenclature of affairs of the organization is compiled on the basis of the final records of the nomenclature of affairs of structural units. Cases of the organization are subject to registration at their institution and at the end of the year. The execution of cases includes a set of works on their technical processing and is carried out by employees of the respective structural units with methodological assistance and under the supervision of the departmental archive. Properly executed documents in business will help in creating a good archival fund of the region, territory, region. Archive collections have been enacted in some regions. Depending on the storage period, full or partial execution of cases is carried out. Cases of permanent, long-term storage and personnel are subject to full registration, which provides for filing or binding of the case, numbering of sheets in the case, compilation of a sheet - the witness of the case, preparation of the internal list of documents, if necessary, registration of the details of the cover of the case. Cases of temporary (up to 10 years inclusive) storage are not filed, sheets are not numbered in them, documents in them are stored in folders. The cover of permanent and long-term storage is drawn up in the prescribed form. At the end of the year, clarifications are made in the inscriptions on the covers of permanent and long-term storage cases: the correspondence of the headings of the cases on the cover to the contents of the filed documents is checked, if necessary, additional information is entered in the heading of the case (numbers of orders, protocols, types and forms of reporting, etc.). ) The date on the cover should correspond to the year of institution and the end of the case. In a case containing documents for years earlier than the year the case was established, the entry shall be made under the date: “There are documents for. Years.” On the covers of cases consisting of several volumes (parts), the last dates of the documents of each volume (part) are affixed. When indicating the exact calendar date, the date, month and year, or year, month and day, shall be indicated. The date and year are indicated in Arabic numerals, the name of the month is written in words. When designating a calendar date, abbreviated digital writing is allowed, in the event that this will lead to an ambiguity in the interpretation of the date. On the cover can be affixed in pencil, in agreement with the departmental archive, the case number according to the inventory, the number of the inventory and the fund.

If you change the name of the organization (its structural unit) during the period covered by the documents of the case, or when you transfer the case to another organization (structural unit), the new name of this organization (structural unit) is added on the cover. The inscriptions on the covers of permanent and long-term storage should be made clearly black light-resistant ink or ink. Inventories are compiled annually for completed cases of permanent, long-term storage and personnel matters that have undergone an examination of value and execution in accordance with the above requirements. For temporary storage, inventories are not compiled. Inventories of structural divisions are made by the employees responsible for the documentation, under the direct methodological guidance of the departmental archive. Inventories are compiled separately for permanent storage, for long-term storage, for personnel matters and other similar cases (scientific reports on topics, judicial and investigative cases, rationalization proposals, etc.).

In organizations that have a large volume of documents for each structural unit, inventories for permanent storage are compiled annually by each unit under the direct methodological guidance of the departmental archive. Inventories prepared by structural divisions serve as the basis for preparing a consolidated inventory of the organization, which is prepared by the departmental archive and according to which it submits the files for state storage. Inventories of cases on personal (personal affairs) can be compiled over several years with a continuous numbering of cases. Inventories of cases are compiled in the prescribed form in two or three copies. When transferring cases from the structural unit to the departmental archive, all copies of the inventory against each case included in it are marked with a note on the case. At the end of each copy, inventories are indicated in numbers and in words the number of actually transferred (accepted) cases, the signatures of the employees involved in the reception and transmission of numbers are affixed, the date is indicated. When transferring particularly valuable cases, the number of sheets in the cases is checked.

Along with cases, registration and control cards for documents, registration books and other registration forms, which are included in the inventory as separate storage units, are transferred from structural divisions to the departmental archive.

Reference file cabinets, information retrieval arrays on machine carriers.

Reference card files are formed from registration and control cards depending on the tasks of specific IPS in accordance with the classifiers used. The following file cabinets are compiled: reference, reference and reference files; file cabinets on proposals, statements and complaints of citizens; thematic (codification) file cabinets for departmental standards, orders, decisions, etc. Information retrieval arrays of automated IPS are formed on the basis of information about documents recorded on computer media.

4. IPS technologies

The basis for the construction of the IPS is the classification of documents and document information and various classification guides (classifiers) developed on its basis. In order to introduce registration indicators into the formational search system, classifiers are most often developed and applied that allow you to quickly enter the value of these indicators and ensure their unambiguous interpretation.

These include the following classifiers:

Classifier of types of documents, which displays the types of documents used in the organization, both created and received. This classifier is used when filling out such details of the RKK as the “name of the document”. The names of the types of documents must comply with the terminology used in the All-Russian Classifier of Management Documentation (OKUD), in the State Social Audit Office, and industry normative and methodological documents. Based on this classifier, the organization develops various lists: “A list of documents with deadlines”; “List of documents not subject to registration”; “A sheet of documents of the organization and its structural divisions”; "The list of cases indicating the shelf life"; and etc.

The classifier of correspondents includes a list of names of organizations, institutions, enterprises and other permanent correspondents (firms, individuals) with which the organization is associated by the nature of its activities. This classifier is based on the use of the All-Russian Classifier of Enterprises and Organizations (OKPO). With the help of the correspondent classifier, the details of the RSC “Author (correspondent)” are filled in.

The classifier of questions of the organization’s activities includes a list of questions of the organization’s activities with their correlation to the level of competence, i.e. also fixes the distribution of responsibilities accepted in the organization. With the help of this classifier, the attribute of the RSC “Document Title or Summary” is filled in. This classifier is also used in the area for preliminary consideration of documents when determining the route of movement of a document.

The classifier of structural divisions of an organization is developed on the basis of organizational documents: the charter (regulation) of the organization, the approved structure and staffing table, and regulations on structural divisions. Often in the industry there may be a document defining a typical structure for a group of homogeneous organizations (schools, universities, social protection departments, social assistance centers). In the classifier of structural divisions, the service of document management support is always always indicated first, and then structural divisions are listed in the order in which they are given in the approved structure. The indices of structural divisions are used when assigning a registration number to a document and filling out such indicators of the RCM as “Document Number and Date” (for internal and outgoing documents), “Number and Date of receipt of document” (for incoming documents).

The classifier of structural units is used to fill out such a proxy of the RKK as “Resolution”.

Classifier of performers. As a rule, this classifier includes the names of deputy heads, heads of structural divisions responsible for the execution of documents, which are listed in the same sequence as in the classifier of structural divisions.


To achieve information compatibility of search arrays of an industry organization, a centralized development of classifiers is required: a typical nomenclature of cases, a classifier of correspondents, a classifier of structural units (if there are typical structures), a classifier of names of types of documents, a classifier of aspects of the organization’s activities, a classifier of issues contained in proposals, statements and complaints citizens, etc. Without a well-developed IPS, effective document management is impossible. IPS is a non-separable workflow system. AT modern world the role of information retrieval systems is more in demand than ever. Of course, in the age of electronic technology, the task of finding documents can be easily simplified if it is automated. This task was performed or is being carried out by all large organizations, the workflow of which is simply very large and an effective search greatly facilitates the work and saves time. In this course work, an example of the work of the IPS in the organization was given. Its composition and stages of work.

Chapter 1. Information retrieval systems

1 The concept of information retrieval systems

2 History of IPS

3 IPS structure

4 Types of IPS

Chapter 2. Modern information retrieval systems

1 Scopes of use of modern IPS

2 Architecture of modern IPS

3 Popular IPS



Relevance. The current stage of development of civilization is characterized by the transition of the most developed part of humanity from industrial society   to informational. One of the most striking phenomena of this process is the emergence and development of a global information computer network.

The problem of searching and collecting information is one of the most important problems of information retrieval systems. Of course, one cannot compare in this respect, say, the Middle Ages, when the search for information was a problem because this information was scarce, and efforts were required only to find at least something on a more or less significant issue of interest. So, at first there was an opportunity to go to the library and, spending time there on choosing the right book from the catalog, find the necessary information. But catalogs do not completely solve the problems of information retrieval even within the framework of one library, since relatively little information is included in the catalog entry: title, author, place of publication. The problem of finding information acquired a new character in the 20th century, with the beginning of the development of the century of information technology. Now it consists not in the fact that there is little information and therefore it is difficult to find, but in the fact that it is now becoming more and more on the contrary, and from this finding the answer to the question of interest can also be quite a difficult task. The problem of finding information is much more complicated when using virtual sources. It uses the technology of online directories, as a result of which the user has the ability to search directories of several libraries at once, which, in fact, complicates the task even more, but, on the other hand, increases the chances of solving it.

At the present stage, the entire information space in which we live is more and more immersed in the Internet. The Internet is becoming the main form of existence of information without canceling the traditional ones, such as magazines, radio, television, telephone, and various reference services.

The aim of the study is to study automated information retrieval systems.

Task in this course work are considered theoretical basis   automated information retrieval, classification and varieties of information retrieval systems. It also analyzes the material on currently used information - search directories   full-text and hypertext search engines.

With the advent of the Internet, the search problem became more relevant. Internet - Worldwide computer network, which is a single information environment and allows you to receive information at any time. But on the other hand, a lot is stored on the Internet useful informationbut it takes a lot of time to find it. This problem has led to the emergence of search engines. In this course work will be considered search engines on the Internet.

Chapter 1. Information retrieval systems

1 The concept of information retrieval systems

The search for information is a task that mankind has been solving for many centuries. As volume grows information resourcespotentially accessible to one person (for example, a library visitor), more sophisticated and sophisticated search tools   and techniques to find the necessary document.

Automated search system - a system consisting of personnel and a set of automation tools for its activities, implementing information technology execution established functions.

The experience and practice of creating systems in various fields of activity allows us to give a broader and universal definition that more fully reflects all aspects of their essence.

An information retrieval system is a system that provides the search and selection of necessary data in a special database with descriptions of information sources (index) based on the information retrieval language and relevant search rules.

The main task of any IPS is to search for information relevant to the information needs of the user. It is very important as a result of the search that you don’t lose anything, that is, find all the documents related to the request and not find anything superfluous. Therefore, a qualitative characteristic of the search procedure is introduced - relevance.

Relevance is the relevance of search results to a formulated query.

Next, we will mainly consider IPS for the World Wide Web (WorldWideWeb). The main indicators of IPS for WWW are spatial scale and specialization. On a spatial scale, IPS can be divided into local, global, regional and specialized. Local search engines can be designed to quickly search pages on a single server scale. Regional IPSs describe the information resources of a particular region, for example, Russian-language pages on the Internet. Global search engines, unlike local ones, strive to embrace the immensity - to describe the resources of the entire information space of the Internet as fully as possible.

2 History of IPS

Let us turn to the history of the emergence of the Internet, which was created in connection with the need to share information resources distributed between various computer systems. Most of the early applications, including FTP and email, were developed exclusively for exchanging data between Internet host computers.

Other applications, such as Telnet, were created so that the user can access not only information, but also the working resources of the remote system. With the development of the Internet (an increase in users and host computers), previous methods of data exchange have ceased to meet the increased needs of users. There was a need to develop new ways to search for network resources and access to them, which would allow the use of information regardless of its format and location.

To meet these needs, the Archie search engine was first created, solving problem   localization of resources on the FTP server, and the Gopher system, which simplifies access to various network resources. Then the World Wide Web and WAIS network information systems were developed, offering completely new methods for obtaining information. The principles of operation of these systems make it easy to navigate a huge amount of information resources without the need to provide mechanisms for the operation of the Internet itself. This approach allows us to speak not only about resources interconnected computer systems, but about the special information spaces of the network.

Archie system is a complex software toolsworking with special databases. These databases contain constantly updated information about files that can be accessed through the FTP service. Using the services of the Archie system, you can search for a file by its name template. In this case, the user will receive a list of files with an exact indication of where they are stored on the network, as well as information about the type, creation time and file size. Archie’s information retrieval system can be accessed in a variety of ways, ranging from queries to e-mail   and using the Telnet service and ending with the use of graphical Archie-clients.

The Gopher system was designed to simplify the process of localizing Internet FTP resources and to more conveniently present information about the contents of files stored on FTP servers. The Gopher system provides an opportunity in a convenient form (in the form of a menu) to present users about the available files and their contents. Gopher server menus may contain links to other Gopher and FTP servers. Thus, the user gets the opportunity travel   on the Internet, not paying attention to the location of the resources of interest to him, and gain access to these resources.

The Veronica system is used to search for information in the Gopher space by the headings of menu items. After entering the keyword, the Veronica system finds out whether it appears in the menu on any Gopher server, and as a search result displays a list of headers of menu items containing the keyword. Since Veronica is not a standalone system search program, but is closely connected with the Gopher system, it has the same drawback as the Gopher system: it is far from always possible to say in the heading what a particular information resource is. The advantages of the system is that there is no need to find out where the information is located, just select the desired entry from the list.

3 IPS structure

The basis for building the structure of the information retrieval system was its functional purpose, scope and features of the subject area described by it.

Functionally, the IPS is designed for quick and convenient search and retrieval of data from large arrays of information on stepper motors, both for internal work with data and for preparing them for various CAD systems. This imposes certain requirements on the construction of the user interface and on the form of providing information. When constructing the structure of the IPS, the potential user’s need for access to the system context-sensitive help is also taken into account.

The implementation of the above requirements is entrusted to the following series of structural components, the so-called blocks:

database checks for integrity;



password protection;

output result;

storage of search parameters;

The basis for choosing just such a structure of the information retrieval system for stepper motors is very simple logic - any unit of the system must receive data, process it and give it to the user in a certain order, providing the logic of the process.

Consider each block in more detail (Fig. 1):

The unit for checking the database for integrity checks all the components of the database.

The viewing unit allows you to start working in the system by viewing the database and then select a different operating mode.

The editing unit only edits the numeric fields of the database and allows you to change the characteristics, enter new ones and delete old records in the database tables. Here you can also change the operating mode.

The password protection block blocks access to data editing by entering a six-digit password.

The search block is designed to search by the entered terms of reference (TOR) and switch to other modes of operation.

The search results output unit displays in a certain order all the found stepper motors and their characteristics in accordance with the search specification. The search parameter storage unit records and stores information until the next search step.

The help block serves as a hint in various modes of the system.

Figure 1. IPS structure.

The scope of IPS, as mentioned above, is internal work with information and information processing for use in CAD work, which includes IPS as one of the modules. This leads to very high requirements for the reliability of the functioning of the system, since any CAD system is a rather complicated construction with the given reliability parameters, and each structure included in such a construction must have at least no less reliability than the whole system as a whole. Ensuring the necessary reliability indicators, in turn, is largely determined by the structure of the system. For the organization of the IPS database, a complete study of the subject area is necessary. In this IPS subject area   is a wide class of stepper motors.

information search database

Information retrieval systems (IPS) Internet, with all their external diversity, also fall into one of these classes. Therefore, before getting acquainted with these IPSs, we consider abstract alphabetic (vocabulary), systematic, and substantive IPSs. To do this, we define some terms from the theory of information retrieval.

Classification information retrieval systems

In the classification of IPS, a hierarchical (tree-like) organization of information is used, which is called the CLASSIFIER. Classifier sections are called RUBRICS. The library analogue of the classification IPS is a systematic catalog. The classifier is developed and improved by a team of authors. Then it is used by another team of specialists called SYSTEMATIZERS. The systematizers, knowing the classifier, read documents and assign classification indexes to them, indicating which sections of the classifier these documents correspond to.

IPS Web Rings

Subject IPS from the point of view of the user is arranged the most simple. Look for the name of the desired subject of your interest (the subject may also be something immaterial, for example, Indian music), and lists of related Internet resources are associated with the name. This would be especially convenient if the complete list of items is small.

Vocabulary IPS

Cultural problems associated with the use of classification IPSs led to the creation of vocabulary IPSs, with the generalized English name search engines. The main idea of \u200b\u200bthe dictionary IPS is to create a dictionary of words found in Internet documents, in which each word will store a list of documents from which this word is taken.

Information retrieval theory assumes two basic algorithms for the operation of vocabulary IPS: using keywords and using descriptors. In the first case, to evaluate the content of the document, only those words that are used in it are used, and at the request of the IPS, it matches the words from the request with the words of the document, determining its relevance by the number, location, weight of words from the request in the document. All working IPS for historical reasons use this algorithm in various modifications.

When working with descriptors, indexed documents are translated into some descriptor information language. Descriptor information language, like any other language, consists of the alphabet (characters), words, means of expression of paradigmatic and syntagmatic relations between words. Paradigmatics provides for the identification of lexico-semantic relations between concepts hidden in a natural language. Within the framework of paradigmatic relations, one can consider, for example, synonymy, homonymy. Syntagmatics explores relationships between words that allow them to be combined into phrases and sentences. Syntagmatics includes rules for constructing words from elements of the alphabet (coding of lexical units), rules for constructing sentences (texts) from lexical units (grammar).

That is, the user's request is translated into descriptors and processed by the IPS already in this form. This approach is more expensive in terms of computing resources, but also potentially more productive, as it allows you to abandon the relevance criterion and work directly with the pertinence of documents.

Search Results Ranking

Vocabulary IPSs are capable of producing lists of documents containing millions of links. It’s impossible to even view such lists, and it’s not necessary. It would be convenient to be able to set formal criteria for the (at least relative) importance (from the point of view of continent) of documents so that the most important documents fall at the top of the list. All IPSs currently focus on the ranking algorithm for the links received.

The most commonly used criteria for ranking in the IPS are the presence of words from the query in the document, their number, proximity to the beginning of the document, proximity to each other;

The presence of words from the request in the headings and subheadings of documents (headings should be specially formatted);

The number of links to this document from other documents; "Recoverability" of referenced documents.

Chapter 2. Modern IPS

1 Scopes of use of modern IPS

Modern IPS are characteristic of the so-called information industry - the newest area of \u200b\u200bthe economy and social sphere, engaged in the processing, systematization, accumulation and dissemination of information. The rapid development of IPS is associated with the success of informatics (Informatics). The subjects of the request in the IPS can be bibliographic data, managerial and factual information, expert assessments, retrospective experience, the results of model studies, etc. Such a wide range of tasks leads to a wide variety of types of IPS. They differ in their goals, the amount of information contained, the types of information, ways of bringing it to the consumer. Along with local IPS operating within the same institution (for example, polyclinics or hospitals), there are national and international information service centers (for example, in the field of environmental protection). Widespread bibliographic IPS (for example, containing a bibliography in all areas of medicine and biomedical sciences). The mass production of personal computers, the development of communications, the possibility of combining computers into information networks and accessing information stored in the memory of other computers from their workplace significantly expanded the range of information used, the breadth and depth of its search. A qualitatively new stage in the development of IPS is associated with the formation of databases on computer-readable media. Such databases allow you to access them remotely, at the same time for many queries, receiving search results quickly and conveniently.

Medicine and healthcare are an extremely specific area of \u200b\u200bIPS implementation. This is due to the complex structure and variety of forms of health information, which includes difficult formalized concepts and categories, as well as significant arrays of data to be recorded. A peculiarity of medical information is that the results of individual clinical or experimental observations, as they accumulate and generalize, become the basis for large-scale health and social events. Health information is the basis for making managerial decisions - from choosing the most important areas of research work to conducting emergency sanitary and preventive measures. The arrays of information, based on the analysis of which healthcare is managed, include statistics (demographic and population statistics, personnel statistics, data on morbidity and mortality, etc.), generalized data on the state and achievements of medical and a number of related scientific disciplines, and experience from previous years. It was the complex nature of the information that led to the development of a unified concept of IPS. It includes the phased creation of individual subsystems, the union of which is achieved both at the level of database exchange, and (or) using communication tools.

The process of developing and integrating subsystems in the IPS can be carried out vertically and horizontally as they are created. Subsystems that are auxiliary (for example, accounting and movement of personnel, planning and financing) can be created independently of others. At the lower level of the healthcare institution (hospitals, clinics, research institutes) they use IPS to keep medical records, monitor the effectiveness of treatment measures, collect and process primary statistical data, and also to solve managerial tasks of their level of competence (use of beds and laboratory diagnostic equipment, drug provision, etc.). Carrying out operational functions, these IPS simultaneously accumulate, and then transmit the necessary information to more high level   (city, regional). Subsystems of reference and information services (in the field of bibliography and scientific research, regulatory materials, standards) are created separately. As part of the general IPS, subsystems can be developed to support and develop individual services (for example, psychiatric, oncological) or targeted programs (for example, side effects of drugs).

2 Architecture of modern IPS for WWW

Before describing the problems of building information retrieval systems Web and ways to solve them, we consider a typical scheme of such a system (Fig. 2).

Figure 2 Typical circuit   information retrieval system.

(client) in this diagram is a program for viewing a specific information resource. The most popular are multi-protocol programs like Netscape Navigator. Such a program provides viewing wWW documents, Gopher, Wais, FTP Archives, Usenet Mailing Lists and Newsgroups. In turn, all these information resources are the object of search for the information retrieval system. The interface (user interface) is not just a browser, in the case of the information retrieval system this phrase also means the way the user communicates with the search engine: the system for generating queries and views search results.engine (search engine) - serves to broadcast a query in the information retrieval language (IPN), into a formal query of the system, search for links to information resources of the Web and delivery of the results of this search to the user. database (database index) - an index that is the main data set of the IPS and serves to search for the address of an information resource. The architecture of the index is designed in such a way that the search takes place as quickly as possible and at the same time it would be possible to evaluate the value of each of the found information resources of the network. (User requests) are stored in his (user) personal database. Debugging each request takes a lot of time, and therefore it is extremely important to remember requests for which the system gives good answers. Robot (robot-indexer) - serves to scan the Internet and keep the index database up to date. This program is the main source of information about the state of information resources of the network.sites - this is the entire Internet, or rather, information resources, the viewing of which is provided by viewing programs.

2.3 Popular Search Engines

According to LiveInternet data on the coverage of Russian-language search queries:

Multilingual: (37.2%) (0.8%)! (0.2%) and search engines owned by this company:

English and international: (Teoma mechanism)

Russian-speaking - the majority of "Russian-speaking" search engines index and search for texts in many languages \u200b\u200b- Ukrainian, Belorussian, English, Tatar, etc. They differ from the "multilingual" systems that index all documents in a row in that they mainly index resources located in domain zones where Russian dominates or in other ways restrict their robots to Russian-language sites.

Yandex (48.1%). Ru (5.9%)

Rambler (1.2%)

Nigma (0.3%)

Some of the search engines use external search algorithms. So, uses the Yandex search engine, and Nigma combines both its own algorithm and the combined search results from other search engines.


The search engines I reviewed are far from perfect. It is believed that an ideal search engine should meet the following requirements:

Easy to use

Clearly organized and updated index.

Quick database search and quick response.

Reliability and accuracy of search results.

The scale of information resources and their number is constantly expanding. It becomes clear that the database is not perfect. Intelligent agents - a new direction underlying the new generation of search engines that can filter information and get more accurate results. The Internet continues to evolve with relentless intensity, essentially erasing the restriction on the distribution and receipt of information in the world. However, it is not very easy to find the necessary document in this information ocean; one should also bear in mind that new networks appear along with long-running servers in the network.

