Search engines. Russian search engines and leading Internet search engines

A graduate doctor can find scientific articles on the Internet for writing a literary review of a medical Ph.D. thesis, articles in a foreign language to prepare for the exam of a candidate minimum, a description of modern research methods and much more ...

How to search for information on the Internet using search engines will be discussed in this article.

For those who are still not very well versed in such concepts as a site, a server - I inform you about the initial information about the Internet.

The Internet is a multitude of sites hosted on servers connected by communication channels (telephone, fiber optic and satellite lines).

A site is a collection of documents in html format (site pages) linked by hyperlinks.

A large site (for example, "Medlink" - themed medical catalog http://www.medlinks.ru - consists of 30,000 pages, and the amount of disk space that it occupies on the server is about 400 MB).
A small site consists of several dozens - hundreds of pages and takes 1 - 10 Mb (for example, my site "Postgraduate Doctor" on July 25, 2004 consisted of 280 .htm pages and occupied 6 Mb on the server).

A server is a computer connected to the Internet and working around the clock. From a server, from several hundred to several thousand sites can be hosted simultaneously.

Sites hosted on a server computer can be viewed and copied by Internet users.

To ensure uninterrupted access to the sites, the server is powered through uninterruptible power supplies, and the room where the servers are working (data center) is equipped with an automatic fire extinguishing system, round-the-clock technical personnel are on duty.

For more than 10 years of its existence, the Runet (Russian-language Internet) has become an orderly structure and the search for information on the Web has become more predictable.

The main tool for finding information on the Internet is search engines.

A search engine consists of a spider program that scans Internet sites and a database (index), which contains information about sites viewed.

At the request of the webmaster, the spider robot enters the site and looks at the pages of the site, entering information about the pages of the site in the search engine index. The search engine itself can find the site, even if its webmaster did not apply for registration. If a link to a site gets somewhere in the path of a search engine (on another site, for example), then it will immediately index the site.

The spider does not copy the site’s pages to the search engine index, but saves information about the structure of each site’s page — for example, which words appear in the document and in what order, the address of the site’s hyperlinks, the size of the document in kilobytes, the date it was created, and much more. Therefore, the search engine index is several times smaller than the amount of indexed information.

What and how does a search engine search on the Internet?

People came up with a search engine to help them find information. What is information in our human understanding and visual representation? These are not smells or sounds, not sensations and not images. These are just words, text. When we search for something on the Internet, we ask for words - a search query, and in response we hope to get a text containing exactly these words. Because we know that the search system will look for exactly the words we requested in the information array. Because it was just such that she was conceived to search for words.

A search engine does not search for words on the Internet, but in its index. The search engine index contains information on only a small number of Internet sites. There are search engines that index only sites in English, and there are search engines that enter only Russian-language sites in their index.

(the index contains sites in English, German and other European languages)

Runet Search Engines (the index contains sites in Russian)

Features of some Runet search engines

Google search engine does not take into account the morphology of the Russian language. For example, Google considers the words "dissertation" and "dissertation" different.

You need to view not only the first page of the search query result, but also the rest.

Because often the sites that contain the information really needed by the user are on the 4-10 page of the search query result.

Why it happens? Firstly, many site creators do not optimize their site pages for search engines, for example, do not include meta tags in site pages.

Meta tags are service elements of a web document that are not visible on the screen, but are important when your site is searched by search engines. Meta tags make it easier for search engines to search so that they don’t have to go deeper into the document and analyze the entire text of the site in order to compose a definite picture of it. The most important meta tag is meta NAME \u003d "keywords" - the keywords of the site page. If the word from the main text of the document is not regarded as "search spam" and is in the "keywords" among the first 50, then the weight of this word in the request increases, that is, the document receives higher relevance.

Secondly, there is fierce competition between webmasters of sites for the first positions as a result of a search query.

According to statistics, 80% of visitors to the site come from search engines. Sooner or later, webmasters will realize this and begin to adapt their sites to the laws of search engines.

Unfortunately, some of the creators of the sites use the dishonest method of promoting their site through search engines - the so-called "search spam" to create seemingly match the content of the meta tags and the rest of the text of the site - place hidden words in the background on the pages of the site, so that they Do not interfere with site visitors. However, the creators of search engines track such tricks and the site of the "search spammer" falls from the heights reached to the very bottom.

Metaphors and figurative comparisons are of little use on the Internet. They distort the truth, lead Internet users away from accurate and unambiguous information. The less artistry and more accuracy in the style of the author of the site - the higher the position in the results of the search query the site occupies.

In turn, if you want the search engine to find articles on the Internet for you, think like a machine, become a machine. At least for a while. At the time of the search.

What is it

DuckDuckGo is a fairly well-known open source search engine. Servers are located in the USA. In addition to its own robot, the search engine uses the results of other sources: Yahoo, Bing, Wikipedia.

The better

DuckDuckGo positions itself as a search that provides maximum privacy and confidentiality. The system does not collect any data about the user, does not store logs (there is no search history), the use of cookies is as limited as possible.

DuckDuckGo does not collect or share user information. This is our privacy policy.
Gabriel Weinberg, founder of DuckDuckGo

Why do you need it

All major search engines try to personalize based on data about a person in front of the monitor. This phenomenon is called the “filter bubble”: the user sees only those results that are consistent with his preferences or that the system considers them to be.

DuckDuckGo creates an objective picture that does not depend on your past behavior on the Web, and eliminates the thematic advertising of Google and Yandex, based on your requests. Using DuckDuckGo, it’s easy to search for information in foreign languages: Google and Yandex, by default, prefer Russian-language sites, even if the request is entered in another language.

What is it

not Evil is a system that searches the anonymous Tor network. To use, you need to go into this network, for example, by launching a specialized network of the same name.

not Evil is not the only search engine of its kind. There is LOOK (the default search in the Tor browser, accessible from the regular Internet) or TORCH (one of the oldest search engines on the Tor network) and others. We settled on not Evil due to an explicit hint of Google (just look at the start page).

The better

It searches where Google, Yandex, and other search engines are barred from entering.

Why do you need it

The Tor network has many resources that cannot be found on the abiding Internet. And their number will grow as the authorities tighten control over the content of the Network. Tor is a kind of network within the Network with its social networks, torrent trackers, media, marketplaces, blogs, libraries and so on.

3. YaCy

What is it

YaCy is a decentralized search engine based on the principle of P2P networks. Each computer on which the main software module is installed scans the Internet independently, that is, it is an analogue of a search robot. The results are collected in a common database that all YaCy members use.

The better

It's hard to say here whether this is better or worse, since YaCy is a completely different approach to organizing searches. The absence of a single server and the owner company makes the results completely independent of one's preferences. The autonomy of each node eliminates censorship. YaCy is able to search deep web and non-indexed public networks.

Why do you need it

If you are a supporter of open source software and free Internet, not subject to the influence of government agencies and large corporations, then YaCy is your choice. It can also be used to organize searches within a corporate or other autonomous network. And while YaCy is not very useful in everyday life, it is a worthy alternative to Google in terms of the search process.

4. Pipl

What is it

Pipl - a system designed to search for information about a particular person.

The better

The authors of Pipl argue that their specialized algorithms are more efficient than "regular" search engines. In particular, the priority sources of information are social network profiles, comments, lists of participants and various databases where information about people is published, such as court decisions. Pipl's leadership in this area has been confirmed by Lifehacker.com, TechCrunch, and other publications.

Why do you need it

If you need to find information about a person living in the USA, then Pipl will be much more effective than Google. Databases of Russian ships, apparently, are not available to the search engine. Therefore, he does not cope with the citizens of Russia so well.

What is it

FindSounds is another specialized search engine. Searches for various sounds (home, nature, cars, people and so on) in open sources. The service does not support requests in Russian, but there is an impressive list of Russian-language tags by which you can search.

The better

In the issuance of only sounds and nothing more. In the search settings, you can set the desired format and sound quality. All found sounds are available for download. There is a search for sounds in the pattern.

Why do you need it

If you need to quickly find the sound of a musket shot, the blows of a woodpecker, or the cry of Homer Simpson, then this service is for you. And we chose this only from the available Russian-language queries. In English, the spectrum is even wider.

But seriously, a specialized service involves a specialized audience. But what if it comes in handy for you?

What is it

Wolfram | Alpha is a search engine. Instead of links to articles that contain keywords, it gives a ready-made answer to a user’s request. For example, if you enter “compare the population of New York and San Francisco” into the search form in English, then Wolfram | Alpha immediately displays the tables and graphs with the comparison.

The better

This service is better than others for finding facts and calculating data. Wolfram | Alpha accumulates and organizes the knowledge available on the Web from various fields, including science, culture and entertainment. If this database contains a ready-made answer to a search query, the system displays it; if not, it calculates and displays the result. In this case, the user sees only the necessary information and nothing more.

Why do you need it

If you, for example, are a student, analyst, journalist or researcher, you can use Wolfram | Alpha to search and calculate data related to your activity. The service does not understand all requests, but is constantly evolving and becoming smarter.

What is it

The Dogpile metasearch engine displays a combined list of results from search results of Google, Yahoo and other popular systems.

The better

Firstly, Dogpile displays less advertising. Secondly, the service uses a special algorithm to find and show the best results from different search engines. According to Dogpile developers, their systems form the most complete output on the entire Internet.

Why do you need it

If you can’t find the information on Google or another standard search engine, look for it in several search engines at once using Dogpile.

What is it

BoardReader is a text search system for forums, Q & A services and other communities.

The better

The service allows you to narrow the search field to social sites. Thanks to special filters, you can quickly find posts and user comments that match your criteria: language, publication date and site name.

Why do you need it

BoardReader can be useful for PR specialists and other media professionals who are interested in the opinion of the mass audience on certain issues.

Finally

The life of alternative search engines is often fleeting. About the long-term prospects of such projects, Lifehacker asked Sergey Petrenko, the former general director of the Ukrainian branch of Yandex.

Sergey Petrenko

Former CEO of Yandex.Ukraine.

As for the fate of alternative search engines, it is simple: to be very niche projects with a small audience, therefore without clear commercial prospects or, conversely, with complete clarity of their absence.

If you look at the examples in the article, you can see that such search engines either specialize in a narrow but demanded niche, which, perhaps, has not yet grown enough to be noticeable on Google or Yandex radars, or they test the original hypothesis in ranking, which is not yet applicable in a normal search.

For example, if Tor search is suddenly in demand, that is, at least a percentage of Google’s audience will need the results, then, of course, ordinary search engines will begin to solve the problem of how to find and show them to the user. If the behavior of the audience shows that a significant share of users in a noticeable number of requests seems more relevant to the results, data without taking into account factors that depend on the user, then Yandex or Google will start to produce such results.

To be better in the context of this article does not mean to be better in everything. Yes, in many aspects, our heroes are far from Google and Yandex (even far from Bing). But each of these services gives the user something that giants in the search industry cannot offer. Surely you also know similar projects. Share with us - we will discuss.

Search Engines

Search engines allow you to find WWW-documents related to specific topics or equipped with keywords or their combinations. There are two ways to search on the search engines:

· According to the hierarchy of concepts;

· By keywords.

Search engines are populated automatically or manually. The search server usually has links to other search engines, and sends them a search request at the request of the user.

There are two types of search engines.

1. Full-text search engines that index every word on a web page, excluding stop words.

2. "Abstract" search engines that create an abstract of each page.

For webmasters, full-text machines are more useful, since any word that appears on a web page is analyzed to determine its relevance to user requests. However, abstract machines can index pages better than full-text ones. It depends on the algorithm for extracting information, for example, by the frequency of use of the same words.

The main characteristics of search engines.

1. The size of the search engine is determined by the number of indexed pages. However, at each point in time, links issued in response to user requests may be of different prescription. The reasons why this happens:

· Some search engines immediately index the page at the request of the user, and then continue to index pages that have not yet been indexed.

· Others often index the most popular web pages.

2. Date of indexation. Some search engines show the date the document was indexed. This helps the user determine when a document appeared on the network.

3. The indexing depth shows how many pages after the specified search engine will index. Most machines have no restrictions on the depth of indexing. Reasons why not all pages can be indexed:

· Improper use of frame structures.

· Use of a sitemap without duplication by usual links

4. Work with frames. If the search robot does not know how to work with frame structures, then many structures with frames will be missed when indexing.

5. The frequency of links. Major search engines can determine a document’s popularity by how often it is referenced. Some machines based on such data “conclude” whether or not to index the document.

6. Frequency of server updates. If the server is updated frequently, the search engine will reindex it more often.

7. Indexing control. Shows what means you can control the search engine.

8. Redirection. Some sites redirect visitors from one server to another, and this parameter shows how this will be associated with the documents found.

9. Stop words. Some search engines do not include certain words in their indexes or may not include these words in user queries. Such words are usually considered prepositions or frequently used words.

10.Spam fines. Ability to block spam.

11. Delete old data. The parameter that determines the actions of the webmaster when closing the server or moving it to another address.

Examples of search engines.

1. Altavista. The system was opened in December 1995. Owned by DEC. Since 1996, collaborates with Yahoo. AltaVista is the best option for custom search . However, sorting the results by categorypits are not executed and you have to manually view the information provided. AltaVista does not provide a means to retrieve lists of active sites, news, or other content search capabilities.

2.Excite Search. Launched at the end of 1995. In September 1996 - acquired by WebCrawler. This node has a powerful search mech.nizm, the ability to automatically customizeinformation provided, as well as compiled by qualstaff to describe multiple nodes.Excite differs from other search sites in thatallows you to search news services and publishes reviewsWeb pages. The search engine uses toolsstandard keyword search and heuristiccontent search methods. Thanks to this combination,you can find pages that are meaningfulWeb if they do not contain a user-specified keyout words. Disadvantage of excite is a somewhat chaotic interface.

3.HotBot. Launched in May 1996. Owned by Wired. Based on Berkeley Inktomi search engine technology. HotBot is a database containing documents indexed by full text and one of the most comprehensive search engines on the Web. Its means of searching by logical conditions and means of restricting the search to any area or Web site help the user find the necessary information, filtering out unnecessary. HotBot provides the ability to select the required search options from the drop-down lists.

4.InfoSeek. Launched before 1995, easily accessible. Currently contains about 50 million URLs. Infoseek has a well-designed interface, as well as excellent search tools. Most of the answers to the queries are accompanied by links “related topics”, and after each answer there are links “similar pages”. A search engine database of pages indexed in full text. Answers are sorted by two indicators: frequency of word meetings or phrases per page sakh, as well as the positioning of words or phrases on the pages. There is a Web Directory subdivided into 12 categories with hundreds of subcategories for which a search can be performed. Each page of the catalog contains a list of recommended nodes.

5. Lycos. It has been working since May 1994. Widely known and used. The structure includes a directory with a huge number of URLs. and Point search engine with technology for statistical analysis of page content, as opposed to full-text indexing. Lycos contains news, site reviews, links to popular sites, city maps, and address search tools from brazhes and sound and video clips.Lycos organizes responses according to degree of conformityrequests according to several criteria, for example, by numbersearch terms found in the annotation to the doccop, the interval betweendo words in a specific phrase of the document, locationterms in the document.

6. WebCrawler. Opened April 20, 1994 as a project of the University of Washington. Webcrawler provides opportunitiessyntax for instantiation of queries, as well as a large selection node annotations with a simple interface.

Next to each answer, WebCrawler is prevented by a small icon with a rough estimate of how it matches the query. It also displays a page with a brief summary for each answer, its full URL, an accurate assessment of compliance, and also uses this response in the request is modeled as its keywords.GUI for customizing queries inWeb Crawler no. N f allowedthe use of universal characters, and also impossibleassign weights to keywords. There is no way to limit the search fieldspecific area.

7. Yahoo. The oldest Yahoo directory was launched in early 1994. Widely known, often used and most respected. In March 1996, the Yahooligans directory for children was launched. Yahoo's regional and top directories appear. Yahoo is based on user subscription. It can serve as a starting point for any searches on the Web, because with the help of its classification system the user will find a site with well-organized information. Web content is divided into 14 general categories listed on the Yahoo! homepage. Depending on the specifics of the user’s request, it is possible to work with these categories to familiarize yourself with subcategories and lists of nodes, or to search for specific words and terms throughout the database. The user can also limit the search within any section or subsection of Yahoo! Due to the fact that the classification of nodes is performed by people, andnot by computer, link quality is usually very high. However, refining your search in case of failure is a difficult task. The composition of Yahoo ! search engine includedAltaVista, therefore, in case of failure when searching on Yahoo! it happens automatically search engine repetitionAltavista . Then the results are transmitted toYahoo! Yahoo! provides the ability to send searches for Usenet and Fourl 1 to find out email addresses.

Russian search engines include:

1. Rambler.This is a Russian-language search engine. The sections listed on the Rambler homepage cover Russian-language Web resources. There is a classifier of information. A convenient way to work is to provide a list of the most visited sites for each proposed topics.

2. Aport Search. Aport one of the leading certified certified search enginesMicrosoft like local searchsystems for the Russian versionMicrosoft Internet Explorer One of the advantages of Aport is the English-Russian and Russian-English translation of online queries and search results, so you can search in Russian Internet resources without even knowing the Russian language. Moreover you can search for information tion using expressions, even for sentences.Among the main properties of the Aport search engine, you canshare the following:

Translation of the query and search results from Russian into Englishsky language and vice versa;

Automatic check of spelling errors of the request;

Informative output of search results for found sites;

Ability to search in any grammatical form;

advanced query language for professions real users.

Other search properties includefive main code pages (different operatingsystems) for the Russian language, search technology usingi have restrictions onURL and date of documents, search implementation by headings, comments, and captionsdirect to images, etc., saving search parameters and definingthe number of previous user requests, the union copies of the document located on different servers.

3. List. ru ( http://www.list.ru) By its implementation, this server has manyin common with the English systemYahoo! On the main page of the server there are links to the most popular search categories.

The list of links to the main categories of the catalog is central. The search in the catalog is implemented in such a way that as a result of the request both individual sites and categories can be found. In case of a successful search, the URL, name, description, keywords are displayed. Allowed to use yandex query language. WITHlink Structuredirectory "opens in a separate window the full category of katalogs. Implemented the ability to switch from the rubricator to any selected subcategory. More detailed thematic divisionthe current section is represented by a list of links. The catalog is organized as follows way that all sites contained in the lower levels of the structuretours are presented in the headings.The displayed list of resources is sorted alphabetically, but you can choose to sort: by timechange add by transitions by the order of adding to the directory, bypopularity among visitors to the catalog.

4. Yandex. Yandex series software products are a set of full-text indexing and text data search tools, taking into account the morphology of the Russian language. Yandex includes modules for morphological analysis and synthesis, indexing and searching, as well as a set of auxiliary modules, such as a document analyzer, markup languages, format converters, and spider.

Morphological analysis and synthesis algorithms based on the basic dictionary can normalize words, that is, find their initial form, as well as build hypotheses for words not contained in the basic dictionary. The full-text indexing system allows you to create a compact index and quickly search using logical operators.

Yandex is designed to work with texts on the local and global networks, and can also be connected as a module to other systems.

A search engine or simply “search engine” is one that searches for web pages in accordance with a user's request. The most famous search engine in the world is Google, the most popular in Russia is Yandex, and Yahoo is one of the oldest search engines. The architecture of the search engine can be distinguished search engine - the core of the system, represented by a set of software modules; database or indexthat stores information about all known Internet search resources; and a set of sites that are entry points users to the system (www.google.com, www.yandex.ru, ru.yahoo.com, etc.). All this corresponds to the classical three-level architecture of information systems: there is a user interface, business logic, which in this case is represented by the implementation of search algorithms and a database.

Internet Search Specifics

At first glance, an Internet search is not much different from a conventional information search, for example, from processing to a database or from a file search task to. So the developers of the first search engines on the Internet thought, but over time, they realized that they were mistaken ...

The first difference between the search on the Internet and the usual one is that the search algorithm on the same database assumes that its structure is known in advance to the search engine and the query author. On the Internet, for obvious reasons, this is not so. Web pages do not form a directory structure, but a network, which also affects search algorithms, and the format of the data posted on Internet resources is not controlled by anyone.

The second difference, as one of the consequences of the first, is that the query is presented not as a set of parameter values \u200b\u200b(search criteria), but as a text written by a person in a language that is natural to him. Thus, before starting the search, you still need to understand exactly what the query author wants. I note, understand not another person, but a computer.

The third difference is already less obvious, but no less fundamental: in a catalog or database, all elements are equal. There is competition on the Internet, and, consequently, division into more “reliable suppliers of information” and sources close in status to “information garbage”. This is how people classify resources, and search engines also apply to them.

And in conclusion, it should be added that the search area is billions of pages, several kilobytes or more each. About ten million pages are added daily and as many updated. All this is presented in various digital formats. Unfortunately, even modern technologies and resources available to the leaders of the Internet search services market do not allow them to process all this diversity on the fly and in full.

What a search engine consists of

First of all, it is important to realize one more and, probably, the most significant difference between the operation of a search engine on the Internet and the operation of any other information system that searches in various directories and databases. The Internet search engine does not search for information among what is on the Internet at the time of the request, but tries to generate a response based on its own information storage - a database called an index, where it stores dossiers to all known to it and periodically updates it. In other words, the search engine does not work with the original, but with the projection of the region of valid search values. All recent changes on the Internet can be reflected in the search results only after the relevant pages are indexed - added to the search engine index. So, the search engine in a first approximation consists of a search engine, database or index (index) and entry points into the system.

Now briefly about what a search engine consists of:

Spider or spider. An application that downloads pages of Internet resources. The spider doesn’t “crawl” anywhere - it only requests the contents of the pages in the same way as a regular Internet browser does, sending a request to the HTTP server and receiving a response from it. After the contents of the page have been downloaded, it is sent to the indexer and crawler, which are described later.

Indexer The indexer performs an initial analysis of the contents of the downloaded page, selects the main parts (page name, description, links, headings, etc.) and puts it all into sections of the search database - puts it in the search engine index. This process is called indexing internet resources, hence the name of the subsystem itself. Based on the results of the initial analysis, the indexer can also decide that the page is generally “unworthy” to be in the index. The reasons for this decision may be different: the page does not have a name, is an exact copy of another one already in the page index or contains links to resources prohibited by law.

Crawler This “animal” is intended to “crawl” using the links on the page downloaded by the spider. The crawler analyzes the paths leading from the current page to other sections of the site, or to pages of external Internet resources and determines the further procedure for the spider to bypass the threads of the world wide web. It is the crawler that finds the pages new to the search engine and passes them to the spider. Crawler work is based on search algorithms on graphs in width and depth.

Subsystem of processing and delivery of results (Search Engine and Results Engine). The most important part of any search engine. The developers keep the algorithms of this subsystem of the company in strict secrecy, since they are a commercial secret. It is this part of the search engine that is responsible for the adequacy of the response of the search engine to the user's request. Two main components can be distinguished here:
- Subsystem ranking. Ranging - these are pages of Internet sites in accordance with their relevance to a particular request. Page Relevance - this, in turn, is the degree to which the content of the page corresponds to the meaning of the query, and the search engine determines this value independently, based on a huge number of parameters. Ranking - this is the most mysterious and controversial part of the “artificial intelligence” of the search engine. In addition to its structure and content (content), the page ranking is also affected by: the number and quality of links leading to this page from other sites; age of the domain of the site itself; the behavior of users viewing the page and many other factors.
- Subsystem for the issuance of results. The tasks of this subsystem include the interpretation of a user query, its translation into the language of structured queries to the index, and the formation of pages of search results. In addition to parsing the query text itself, the search engine may also consider:
  - Request contextformed based on the meaning of previously made user requests. For example, if a user frequently visits sites on automotive topics, then on a query with the word “Volga” or “Oka”, he probably wants to get information about cars of these brands, and not about where the Russians of the same name start to flow rivers. It is called personalized searchwhen issuing the same request for different users is significantly different.
  - User preferenceswhich she (the search engine) can “guess” about, analyzing user selectable links on search results pages. This is another way to adjust the context of the request: the user with his actions as if tells the machine what exactly he wanted to find. As a rule, search engines try to add pages to the search results that are relevant to the query, but related to quite different areas of life. Suppose a user is interested in cinema and therefore often selects links to pages with announcements of film novelties, even if these pages are not entirely relevant to the original query. When forming an answer to his next request, the system may give preference to pages with a description of films, in the name of which there are words from the text of the request.
  - Region, which is very important when processing commercial requests related to the acquisition of goods and services from local suppliers. If you are interested in sales and discounts and are in Moscow, then you are most likely not at all interested in what promotions on this topic are held in St. Petersburg unless you explicitly indicated this in the request text. First of all, information about sales in Moscow should appear in the search results. Thus, modern search engines share queries by geo-dependent and geo-independent. Most likely, if the search system decides that your request is geo-dependent, then it automatically adds to it a sign of the region, which it tries to determine by the information about your Internet provider.
  - Time. Search engines sometimes have to analyze when there are events described on the page. After all, information is constantly becoming outdated, and the user primarily needs links to the latest news, current forecasts and announcements of events that have not yet been completed or should come in the future. To understand that the relevance of a page depends on time, and comparing it with the moment the query is executed also requires a fair share of intelligence from the search engine.
  Next, the search engine searches for the closest in meaning key request in the index and generates results, sorting the links in descending order of their relevance. Each key query in the index corresponds to a separate page rank relevant to it. Not for every combination of letters and numbers the system starts a new key request, but does this based on an analysis of the frequency of certain user requests. The search engine can also mix rankings from various key queries in the search results if it considers that the user needs this.

General principles of the search engine

You need to understand that Internet search services are a very, very profitable business. You can’t go into details about where companies like Google and Yandex live, since the bulk of their profit is contextual advertising revenue. And since Internet search is extremely profitable, then the competition among such companies is very serious. What determines the competitiveness of the Internet search market? The answer is the quality of the search engine. It is logical that the higher it is, the more new users appear in the system, and the more valuable is the contextual advertising placed on the pages of this very issue. Search engine developers spend a lot of effort to “clear” the results of their search results from all kinds of information garbage, popularly called spam. More details about how this is done will be described in a separate article, and here I will give general principles of the behavior of the search engine formulated in the form of conclusions on all of the above.

A search engine in the face of its spiders and crawlers constantly crawls the Internet for new and updated pages, as irrelevant information is appreciated below.

The search engine periodically updates the ranking of resources according to their relevance to key queries, as new pages constantly appear in the index. This process is called search result update.

Due to the huge amount of information posted on the World Wide Web and the limited resources of the search engine itself, the search engine always tries to download only the most (in its opinion) necessary. In its arsenal there are all kinds of filters that cut off much that is unnecessary already at the stage of indexing or throw out spam from the index as a result of updating search results.

In the course of analyzing a request, modern search engines try to take into account not only the text of the request itself, but also its surroundings: the context and user preferences that were mentioned earlier, as well as the time of the request, region, and much more.

The relevance of a particular page is affected not only by its internal parameters (structure, content), but also by external parameters, such as links to a page from other sites and user behavior when viewing it.

The work of search engines is constantly being improved. The ideal operation of a search engine (for a person) is possible only if all decisions regarding indexing and ranking will be made by a commission consisting of a large number of specialists in all areas and areas of human activity. Since this is unrealistic, such a commission is replaced by expert systems, heuristic search algorithms, and other elements of artificial intelligence. It is likely that the work of all these subsystems could also give more adequate results if it were possible to process absolutely all the data that is publicly available on the Internet, but this is also almost impossible. Imperfect artificial intelligence and limited resources are the two main reasons that the results of search results do not always please users, but all this is cured by time. Today, in my opinion, the work of the most famous and major search engines is consistent with the needs and expectations of their users.

How do prospecting machines work? One of the great features of the Internet is that there are hundreds of millions of web resources waiting and ready to be presented to us. But the bad thing is that there are the same millions of pages that, even if we are needed, they will not appear before us, because just unknown to us. How to find out what and where can be found on the Internet? Usually for this we turn to the help of search engines.

Internet search engines are special sites on the global network that are designed to help people find the information they need on the World Wide Web. There are differences in the ways in which search engines perform their functions, but in general there are 3 main and identical functions:

They all “search” the Internet (or some Internet sector) - based on the given keywords;
- All search engines index the words they are looking for and the places where they find them;
- All search engines allow users to search for words or combinations of keywords based on web pages already indexed and entered into their databases.

The very first search engines indexed up to several hundred thousand pages and received 1,000 - 2,000 queries per day. Today, top search engines have indexed and indexed continuously hundreds of millions of pages, process tens of millions of queries per day. Below we will talk about how search engines work and how they “stack” all the pieces of information found in such a way as to be able to answer any question that interests us.

Take a look at the web

When people talk about internet search engines, they really mean search engines World wide web. Before the Web became the most visible part of the Internet, search engines existed to help people find information on the web. Programs called "gopher" and "Archie" were able to index files located on different servers connected to the Internet Internet and repeatedly reduced the time spent on finding the right programs or documents. At the end of the 80s of the last century, the synonym for "Internet skills" was the ability to use gopher, Archie, Veronica, etc. search programs. Today, most Internet users limit their search to the World Wide Web, or WWW.

Small start

Before answering you where to find the necessary document or file, this file or document should already be found. To find information on hundreds of millions of existing web pages, the search engine uses a special robot program. This program is also called a spider ("spider", spider) and serves to build a list of words found on the page. The process of building such a list is called web crawling (Web crawling). To further build and record a “useful” (meaningful) list of words, the search spider must “browse” a ton of other pages.

How does anyone start spider (spider) your journey through the net? Usually the starting point is the largest world servers and very popular web pages. The spider begins its journey from such a site, indexes all the words found and continues its movement further, following links to other sites. Thus, the spider robot begins to cover all the big "pieces" of web space. Google.com began with an academic search engine. In an article describing how this search engine was created, Sergey Brin and Laurence Page (the founders and owners of Google) gave an example of how fast Google spiders work. There are several of them and usually the search begins using 3 spiders. Each spider supports up to 300 simultaneously open connections to web pages. At peak load, using 4 spiders, the Google system is able to process 100 pages per second, generating traffic of about 600 kilobytes / sec.

To provide spiders with the data necessary for processing, before Google had a server that was only concerned with “throwing” more and more URLs to spiders. In order not to depend on Internet service providers for domain name servers (DNS) that translate url to IP address, Google has acquired its own DNS server, minimizing all the time spent on page indexing.

When Googlebot visits an HTML page, it takes into account 2 things:

Words (text) on the page;
- their location (in which part of the body of the page).

Words located from service sections such as title, subtitles, meta tags and others have been flagged as especially important for custom search queries. The Google Spider was built to index every similar word on the page, with the exception of interjections such as "a," "an" and "the.". Other search engines have a slightly different approach to indexing.

All approaches and algorithms of search engines are ultimately aimed at making spider robots work faster and more efficiently. For example, some search robots track when indexing words in the title, links, and up to the 100 most frequently used words on the page, and even each of the words in the first 20 lines of the text content of the page. This is the indexing algorithm, in particular with Lycos.

Other search engines, such as AltaVista, go in a different direction, indexing every single word of the page, including "a," "an," "the" and other unimportant words.

Meta Tags

Meta tags allow the owner of a web page to specify keywords and concepts that define the essence of its content. This is a very useful tool, especially when these keywords can be repeated up to 2-3 times in the text of the page. In this case, the meta tags can "direct" the search robot to the desired selection of keywords to index the page. There is a possibility of "wrapping" meta tags over popular search queries and concepts that have nothing to do with the content of the old page itself. Search robots are able to deal with this, for example, by analyzing the correlation of meta tags and web page content, “throwing out” those meta tags (respectively keywords) that do not correspond to the content of the pages.

All this applies to cases where the owner of the web resource really wants to be included in the search results for the desired search words. But it often happens that the owner does not want to be an indexed robot at all. But such cases are not related to the topic of our article.

Index building

As soon as the spiders have finished their work on finding new web pages, search engines should place all the information found so that it is convenient to use it in the future. Two key components matter here:

Information stored with data;
- The method by which this information is indexed.

In the simplest case, the search engine could simply place the word and URL address where it is located. But this would make the search engine a completely primitive tool, since there is no information about which part of the document this word is located (meta tags, or in plain text), whether this word is used once or repeatedly, and whether it is contained in a link to another important and related topic. In other words, this method will not allow ranking sites, it will not provide users with relevant results, etc.

To provide us with useful data, search engines store not only information from a word and its URL address. The search engine can save data on the number (frequency) of references to a word on the page, assign the word "weight", which further helps to produce search listings (results) based on weight ranking for a given word, taking into account its location (in links, meta tags, page title etc.). Each commercial search engine has its own formula for calculating the "weight" of keywords during indexing. This is one of the reasons why search engines return completely different results for the same search query.

The next important point in processing the information found is its encoding in order to reduce the amount of disk space to save it. For example, in an original Google article, it is described that 2 bytes (8 bits each) are used to store word weight data - this takes into account the type of word (in capital or lowercase letters), the size of the letters themselves (Font-Size) and other information, which helps to rank the site. Each such “piece” of information requires 2-3 bits of data in a complete 2-byte set. As a result, a huge amount of information can be stored in a very compact form. After the information is "compressed", it's time to start indexing.

The purpose of indexing is one: to provide the fastest search for the necessary information. There are several ways to build indexes, but the most effective is building hash tables (hash table). When hashing, a certain formula is used, with the help of which each word is assigned a certain numerical value.

In any language, there are letters with which many more words begin than with the rest of the letters of the alphabet. For example, the words on the letters "M" in the section of the English dictionary are much more than on the letter "X". This means that finding a word starting with the most popular letter will take longer than any other word. Hashing (Hashing) equalizes this difference and reduces the average search time, and also separates the index itself from real data. The hash table contains hash values \u200b\u200balong with a pointer to the data corresponding to this value. Effective indexing + efficient placement together provide high search speed, even if the user asks a very complex search query.

The future of search engines

A search based on Boolean operators ("and", "or", "not") is a literal search - the search engine receives the search words exactly as they are entered. This can cause a problem when, for example, the word entered has many meanings. A “key,” for example, can mean “a means to open a door,” or it can mean a “password” to enter a server. If you are interested in only one meaning of a word, then you obviously will not need data on its second meaning. You can, of course, build a literal query that allows you to exclude data output by an unnecessary word value, but it would be nice if the search engine could help you yourself.

One of the research areas in the field of algorithms for future search engines is the conceptual search for information. These are such algorithms, where a statistical analysis of pages containing a given search keyword or phrase is used to find relevant data. It is clear that such a “conceptual search engine” will require much more space to store data about each page and more time to process each request. Many researchers are currently working on this issue.

No less intensive work is being done in the field of developing search-based search algorithms natural language (Natural-Language query).

The idea behind natural queries is that you can write the query as if you asked a colleague sitting opposite you. No need to worry about Boolean operators or strain to make up a complex query. The most popular search engine based on the natural query language today is AskJeeves.com. It converts the query into keywords, which it then uses when indexing sites. This approach only works with simple queries. However, progress does not stand still, it is possible that very soon we will "talk" with search engines in our own, "human language".