What is the robots of Yandex search and Google with simple words. Search robots Google, Yandex, other PS and services Why need search robots

Usually, search engine It is a website specializing in searching for information that meets the criteria for a user request. The main task of such sites is to streamline and structuring information on the network.

Most people using search engine services are never asked as a machine acting, looking for the necessary information from the depths of the Internet.

For an ordinary user network, the very concept of the principles of work of search engines is not critical, since the algorithms that the system is guided is capable of satisfying people who do not know how to make an optimized query when searching for the necessary information. But for a web developer and specialists engaged in optimizing sites, it is simply necessary to possess at least the initial concepts about the structure and principles of search engines.

Each search engine works on accurate algorithms that are kept under the strictest secret and are known only to a small circle of employees. But when designing a site or optimization it, it is necessary to take into account the general rules for the functioning of search engines, which are considered in the proposed article.

Despite the fact that each PS has its own structure, after thoroughly, they can be combined into the main, generalizing components:

Indexing module

Indexing Module - This element includes three additional components (robots programs):

1. Spider. (Robot Spider) - downloads pages, filters text stream removing all internal hyperlinks from it. In addition, Spider saves the dates of downloading and the server response header, as well as the URL address of the page.

2. Crawler. (crawling robot spider) - analyzes all links on the page, and on the basis of this analysis, determines which page to visit, and what is not worth it. In the same way, Krauler finds new resources that must be processed by PS.

3. Indexer (Robot indexer) - is engaged in analyzing Internet pages downloaded by spider. At the same time, the page itself is divided into blocks and is analyzed by an indexer with the help of morphological and lexical algorithms. Under the analysis of the indexer, various parts of the web page are falling: headlines, texts and other service information.

All documents that have been processed by this module are stored in the search engine database called the system index. In addition to the documents themselves, the database contains the necessary service data - the result of careful processing of these documents, guided by which the search engine performs user requests.

Search server.

The next, very important component of the system is a search server, whose task is to handle the user request and generating the search results page.

Processing the user's request, the search server calculates the rating of the relevance of selected documents by the user request. The position that the web page takes the search results to be issued from this rating. Each document that satisfies the search conditions is displayed on the issue page in the form of a snippet.

Snippet is a brief description of the page that includes a header, link, keywords and short text information. By snippet, the user can estimate the relevance of the selected search engine of the pages to its request.

The most important criterion that the search server is guided by ranking the results of the query is the already familiar TIC ().

All described PS components require large costs and very resource-intensive. The search engine performance directly depends on the efficiency of the interaction of these components.

Did you like the article? Subscribe to blog news or share social networks, and I will answer you

6 comments to the post "Search engines of their robots and spiders"

I have long been looking for this information, thanks.

I am glad that your blog is constantly developing. Such posts only add popularity.

Something understood. Question, PR somehow depends on Titz?

Hi friends! Today you will learn how Search robots Yandex and Google work and what function they perform in the promotion of sites. So let's go!

This action search engines are made to find ten Web projects from a million sites that have a high-quality and relevant response to the user's request. Why only ten? Because consists of only ten positions.

Search robots friends and webmasters and users

Why is it important to visit the site by search robots has already become clear, and why this is the user? Everything is true, so that only those sites will be opened to the user who will answer its request in full.

Search robot - A very flexible tool, he is able to find a site, even the one that is only created, and the owner of this site has not yet been engaged. Therefore, this bot was called spider, he can reach his paws and get along the virtual web anywhere.

Is it possible to manage the search robot in your own interests

There are such cases when some pages did not fall into the search. It is mainly due to the fact that this page is not yet indexed by the search robot. Of course, sooner or later, the search robot will notice this page. But it takes time, and sometimes a lot of time. But here you can help the search robot visit this page faster.

To do this, you can place your site in special directories or lists, social networks. In general, on all venues, where the search robot simply lives. For example, on social networks there is an update every second. Try to declare your site, and the search robot will come to your site much faster.

One, but the main rule flows out. If you want a search engine bots to visit your site, they need to give new content on a regular basis. In the event that they will notice that the content is updated, the site is developing, they will visit your Internet project much more often.

Each search robot can remember how often the content changes you. It evaluates not only quality, but temporary intervals. And if the material on the site is updated once a month, then it will come to the site once a month.

Thus, if the site is updated once a week, then the search robot will come once a week. If you update the site every day, then the search robot will visit the site every day or every other day. There are sites that are indexed within a few minutes after the update. These are social networks, news aggregators, and sites that place a few articles per day.

How to task the robot and prohibit him anything?

At the very beginning, we learned that search engines have several robots that perform various tasks. Someone is looking for pictures, someone links so on.

You can manage any robot using a special file. robots.txt . It is from this file that the robot begins to get acquainted with the site. In this file, you can specify whether it is possible to index the robot if so, which partitions. All these instructions can be created for both one and all robots.

Training site promotion

More details about the wisdom of SEO promotion of sites in Google and Yandex search engines, I tell on my Skype. I brought all my Web projects to attendance more and get excellent with that. I can teach it anyone who is interested!

Thematic collections of links are lists compiled by a group of professionals or even single collectors. Very often a highly specialized theme can be disclosed by one specialist better than a group of employees of a large catalog. Thematic collections in the network so much that it does not make sense to give specific addresses.

Selection of domain name

The directory is a convenient search system, however, in order to get to the Microsoft or IBM server, it is unlikely to make sense to access the catalog. Guess the name of the corresponding site is not difficult: www.microsoft.com, www.ibm.com or www.microsoft.ru, www.ibm.ru - Sites of Russian representative offices of these companies.

Similarly, if the user needs a website dedicated to the weather in the world, it is logical to search on the www.weather.com server. In most cases, the search site with a keyword in the title is more efficient than the search for a document in the text of which this word is used. If the Western Commercial Company (or Project) has a single name and implements its server on the network, its name with a high probability is stacked in the format www.name.com, and for the Runet (Russian part of the network) - www.name.ru, where Name - Company name or project. Selection of the address can be successfully competing with other receptions of the search, since with a similar search system you can connect to the server that is not registered in any search engine. However, if you choose the desired name, you will have to refer to the search engine.

Search engines

Tell me what you are looking for on the Internet, and I will tell you who you are

If the computer was a highly intelligent system that could be easily explained that you are looking for, then he would give out two or three documents - exactly those you need. But, unfortunately, this is not the case, and in response to the request, the user usually receives a long list of documents, many of which have nothing to do with what he asked about. Such documents are called irrelevant (from English. Relevant is a suitable, referring to the case). Thus, the relevant document is a document containing the desired information. Obviously, the percentage of obtained relevant documents depends on the skill competently to give out the request. The share of relevant documents in the list of all found search engine documents is called the accuracy of the search. Non-relevant documents are called noise. If all the documents found are relevant (noum noise), then the search accuracy is 100%. If all relevant documents are found, the search fullness is 100%.

Thus, the search quality is determined by two interdependent parameters: accuracy and completeness of the search. Increased search fullness reduces accuracy, and vice versa.

How does search engine

Search engines can be compared with the reference service whose agents bypass companies by collecting information to the database (Fig. 4.21). When contacting the service, the information is issued from this database. The data in the database is obsolete, so the agents are periodically updated. Some enterprises themselves send data about themselves, and they do not have to come to agents. In other words, the help desk has two functions: creating and constantly updating data in the database and search for information in the database upon request of the client.

Fig. 4.21.

Similarly, search engine It consists of two parts: the so-called robot (or spider), which bypass network servers and generates a search engine database.

The base of the robot is basically formed by him (the robot itself finds references to new resources) and to a much lesser extent - resource owners that register their sites in the search engine. In addition to the robot (network agent, spider, worm), which generates a database, there is a program that determines the rating of the found links.

The principle of the search engine is reduced to the fact that it polls its internal directory (database) by keywords that the user indicates in the request field and gives a list of references ranked by relevance.

It should be noted that, by working out a specific user request, the search engine operates in terms of internal resources (and not started on a trip over the network, as non-check-in users are considered), and internal resources are naturally limited. Despite the fact that the search engine database is constantly updated, search engine All Web documents cannot index: their number is too large. Therefore, there is always the likelihood that the desired resource is simply unknown by a specific search engine.

This thought clearly illustrates Fig. 4.22. Ellipse 1 limits the set of all Web documents that exist at some point in time, ellipse 2 - all documents that are indexed by this search engine, and ellipse 3 is the desired documents. Thus, it is possible to find with this search engine only the part of the desired documents that it is indexed.

Fig. 4.22.

The problem of insufficiency of the search fullness consists not only in limited by the internal resources of the search engine, but also that the speed of the robot is limited, and the number of new Web documents is constantly growing. An increase in the internal resources of the search engine cannot fully solve the problem, since the rate of resource bypassing the robot is finite.

At the same time assume that search engine It contains a copy of the source resources of the Internet, it would be incorrect. The full information (source documents) is not always stored, only its part is more often stored - the so-called indexed list, or an index, which is much more compact on the text of the documents and allows you to respond faster to search queries.

To build the index, the initial data is converted so that the volume of the base is minimal, and the search was carried out very quickly and gave maximum useful information. Explaining what an indexed list is parallel with its paper analogue - the so-called concordance, i.e. The dictionary in which the words used by a specific writer are listed in alphabetical order, as well as references to them and the frequency of their use in its works.

Obviously, the Concordans (Dictionary) is much more compact in the original texts of works and find the right word in it is much easier than overclock the book hoping to stumble upon the right word.

Building index

The scheme of constructing an index is shown in Fig. 4.23. Network agents, or spiders, "crawl" over the network, analyze the contents of the Web pages and collect information that and on which page was detected.

Fig. 4.23.

When finding another HTML page, most search engines fix words, pictures, links and other elements (in different search engines in different ways) contained on it. And when tracking words on the page, not only their presence is fixed, but also location, i.e. Where these words are: in the title (title), subtitles (subtitles), in metagas 1 Metatega is the service tags that allow developers to place service information on the Web page, including to orient the search engine. (Meta Tags) or elsewhere. At the same time, significant words are usually recorded, and the unions and interdimensions of the type "A", "but" and "or" are ignored. Metachega allow page owners to identify keywords and subjects by which the page is indexed. This may be relevant in the case when keywords have several values. Metatega can orient the search engine when choosing from several words of the word to the only correct one. However, metaages work reliably only when they are filled with honest website owners. The unscrupulous owners of Web sites are placed in their metataging the most popular words in the network that have nothing to do with the subject of the site. As a result, visitors fall on unsolicited sites, thereby increasing their rating. That is why many modern search engines either ignore metaages, or consider them further relative to the page of the page. Each robot supports its resource list punished for unscrupulous advertising.

Obviously, if you are looking for sites on the keyword "dog", then the search engine should find not just all the pages where the word "dog" is mentioned, and those where this word is related to the topic of the site. In order to determine to what extent, something or that word is related to the profile of some web page, it is necessary to evaluate how often it is found on the page, whether there are links to other pages on this word or not. In short, it is necessary to rank found on the word page according to the degree of importance. Words are assigned weight coefficients depending on how many times and where they meet (in the title of the page, at the beginning or at the end of the page, in the link, in metateg, etc.). Each search engine has its own weight gain algorithm - this is one of the reasons why search engines over the same keyword give various resource lists. Since the pages are constantly updated, the indexing process must be performed constantly. Robots-spiders travel along the links and form a file containing an index that can be quite large. To reduce its size, it is resorted to minimizing the amount of information and compression of the file. Having several robots, the search engine can handle hundreds of pages per second. Today, powerful search engines store hundreds of millions of pages and receive tens of millions of queries daily.

When constructing an index, the task of reducing the number of duplicates is also solved - the task is nontrivial, given that for the correct comparison, you must first determine the application encoding. An even more difficult task is to separate very similar documents (they are called "almost duplicates"), for example, such in which only the title differs, and the text is duplicated. There are a lot of such documents on the network - for example, someone has written off the abstract and published it on the site for his signature. Modern search engines allow you to solve such problems.

Friends, I welcome you again! Now we will analyze what search robots and let us talk about the search robot Google and how to be friends with them.

First you need to understand that in general such search robots are also called spiders. What job search engine spiders are performed?

These are programs that check the sites. They browse all entries and pages on your blog, collect information that is then transmitted to the database of the search engine to which they work.

You do not need to know the entire list of search robots, the most important thing is to know that Google has two major spiders, which are called Panda and Penguin. They struggle with poor quality content and garbage links and need to know how to reflect their attacks.

Google's search robot "Panda" is designed to promote only high-quality material in the search. All baseline sites are lowered in search results.

The first time this spider appeared in 2011. Before his appearance, you could promote any site publishing in articles a large amount of text and using a huge amount of keywords. In the aggregate, these two techniques were displayed on the top of the search for not high-quality content, and good sites decreased in extradition.

"Panda" immediately brought the order by checking all the sites and put everyone in their deserved places. Although it struggles with a base content, but now you can promote even small sites with high-quality articles. Although earlier such sites were useless to promote, they could not compete with giants that have a large number of content.

Now we will deal with you how to avoid "Panda" sanctions. I must first understand what she does not like. I have already written above that she struggles with poor content, but what text is bad for her, let's understand this in order not to publish such on your site.

Google's search robot to strive so that only high-quality materials for applicants were issued in this search engine. If you have articles in which there is little information and they are not attractive externally, then urgently rewrite these texts so that Panda does not get to you.

Qualitative content can have both large volume and small, but if the spider sees a long article with a large number of information, it means it will benefit the reader.

Then you need to mark duplication, and in other words plagiarism. If you think that you will rewrite other people's articles on your blog, you can immediately put the cross on your website. Copying is strictly punished by applying a filter, and checked plagiarism very easy, I wrote an article on the topic how to check texts for uniqueness.

Next that you need to see, this is the abrasion of text with keywords. Who thinks he will write an article from some keys and will take the first place in extradition - it is very mistaken. I have an article, how to check pages for relevance, read be sure to.

And what else can attract "Panda" to you, so these are old articles that are outdated morally and do not bring traffic to the site. They need to be updated.

There is also a search robot Google "Penguin". This spider is struggling with spam and garbage links on your site. It also calculates bought links from other resources. Therefore, so as not to be afraid of this search robot, you must not buy the purchase of links, but to publish high-quality content so that people themselves refer to you.

Now let's formulate that you need to make the site with the eyes of the search robot look perfect:

In order to make high-quality content, first read the topic well before writing an article. Then you need to understand that people are really interested in this topic.

Use specific examples and pictures, it will make an article alive and interesting. Slimming text on small paragraphs to read was easy. For example, if you opened a page with jokes in the newspaper, then what you first read? Naturally, each person first reads short texts, then podlins and the most recently long portals.

Favorite Nadriga "Panda" is not the relevance of the article in which the outdated information is contained. Watch for updates and change texts.

Watch out for the density of keywords, how to identify this density I wrote above, in the service that I told you will get the exact number of keys.

Do not engage in the plagiarism, everyone knows that it is not necessary to steal other things or text - this is the same. For theft will be responsible for the filter.

Texts Write at least two thousand words, then such an article will look like the eyes of the search engine robots informative.

Do not leave from the topic of your blog. If you are blogging on the Internet, then you do not need to print articles about pneumatic weapons. It can reduce your resource rating.

Beautifully decorate articles, divide on paragraphs and add pictures to be nice to read and did not want to quickly leave the site.

By purchasing links, make them the most interesting and useful articles that will actually read people.

Well, now you know what work the robots of search engines are performing and you can be friends with them. And the most important search robot Google and Panda and Penguin are studied in detail.

Definitions and terminology
Names Robotov
A bit of history
What do search engines do
The behavior of robots on the site
Robot Management
conclusions

What are search engine robots? What function they performny? What are the features of the work of search robots? here wewe will try to give the answer to these and some other questions,robots with work.

Definitions and terminology

In English there are several options for search robots: Robots, Web Bots, Crawlers, Spiders; In Russian, one term was actually stuck in Russian - robots, or abbreviated - bots.

On the website www. robotstxt. ORG is given the following definition robots:

"The web robot is a program that is bypassing the hypertext structure of WWW, recursively requesting and removing documents."

Keyword in this definition - recursivelythose. It is understood that after receiving the document, the robot will request documents on the links from it, etc.

Namesrobots

Most search robots have their own unique name (except those robots that for some reason are masked for custom browsers).

The name of the robot can be seen in the User-Agent field of server log files, server statistical systems reports, as well as on search engines help pages.

So, the Yandex robot is collectively called Yandex, Rambler's robot - Stackrambler, Robot Yahoo! - SLURP, etc. Even custom software collects for subsequent viewing can be specially presented using information in the User-Agent field.

In addition to the name of the robot, there may be more information in the User-Agent field: the robot version, the purpose and address of the page with additional information.

Littlestories

Back in the first half of the 1990s, during the development of the Internet, there was a problem of web robots related to the fact that some of the first robots could significantly download a web server, up to its refusal, due to the fact that they did a large number Queries to the site for too short time. System administrators and web server administrators were not able to manage the behavior of a robot within their sites, and could only completely close the access robot not only to the site, but to the server.

In 1994, the Robots.txt protocol was developed, which sets the exceptions for robots and allowing users to manage search robots within their sites. You read about these possibilities in chapter 6 "How to make a site available for search engines."

In the future, as the network grows, the number of search robots increased, and their functionality is constantly expanding. Some search robots did not live to this day, remaining only in the archives of server log files of the late 1990s. Who now remembers the T-REX robot, collecting information for the Lycos system? External like a dinosaur named which is named. Or where can I find Scooter - Altavista robot? NIGHT! But in 2002, he still actively indexed documents.

Even in the name of the main robot Yandex, you can find the echo of the past days: a fragment of its full name "Compatible; Win16; " It was added for compatibility with some old web servers.

whatdorobotssearchsystems

What functions can robots perform?

There are several different robots in the search engine, and each has its own destination. We list some of the tasks performed by robots:

request processing and recovery of documents;
check references;
update monitoring; check availability of the site or server;
analysis of the content of pages for subsequent placement of contextarrex;
collecting content in alternative formats (graphics, data in formatsRatom formats).

As an example, we give a list of Yandex robots. Yandex uses several types of robots with different functions. You can identify them by the User-Agent string.

Yandex / 1.01.001 (compatible; Win 16; i) - a mining indexing robot.
Yandex / 1.01.001 (compatible; win 16; p) image indexer.
Yandex / 1.01.001 (compatible; win 16; h) - batch, which defines the sites.
Yandex / 1.03.003 (compatible; Win 16; D) -Bot, referring to the page when adding it via the "Add URL" form.
Yandex / 1.03.000 (compatible; win 16; m) - a robot, referring to the opening of the page on the link "Found Words".
Yandexblog / 0.99.101 (compatible; dos3.30; Mozilla / 5.0; in; robot) - a robot, indexing XML files to search for blogs.
YandexSomething / 1.0 is a robot, indexing news streams of Yandex. Navigation partners and Robots files. TXT for robot search blogs.

In addition, several tested robots work in Yandex - "kivoks ",which only check the availability of documents, but do not index them.

Yandex / 2.01.000 (compatible; Win 16; Dyatel; c) - "Low-kivalka" Yandex.Catalog. If the site is not available for the other, it is removed from the publication. As soon as the site begins to respond, it appears onventomatically in the catalog.
Yandex / 2.01.000 (compatible; Win 16; dyatel; z) - "Low-kivalka" Yandex. Tops. Links to inaccessible sites highlighting color.
Yandex / 2.01.000 (compatible; win 16; dyatel; d) - "Trecks-roll" Yandex.Direct. It checks the correctness of links from ads before moderation.

Nevertheless, the most common robots are those requested, receive and archive documents for subsequent processing by other search engine mechanisms. It will appropriate to separate the robot from the indexer.

The search robot bypass sites and gets documents in accordance with your internal address list. In some cases, the robot can perform a basic analysis of documents to replenish the address list. Further processing of documents and the construction of the search engine index is already engaged in the search engine indexer. The robot in this scheme is just a "courier" to collect data.

The behavior of robots on the site

What is the difference between the behavior of the robot on the site from the behavior of a regular user?

Controllability.First of all, the "intelligent" robot must request the Robots file from the server. TXT with indexing instructions.
Selective pumping.When requesting a document, the robot is clearly indicated by the requested data, in contrast to the usual browser, ready to take everything. The main robots of popular search engines will first of all request hypertext and ordinary text documents, leaving the files of styled CSS, images, video. ZIP archives, etc. Currently also in demand information in PDF formats, Rich Text, MS Word, MS Excel and some others.
Unpredictability.It is impossible to track or predict the way robot site, because it does not leave information in the Referer field - the address shop where it came from; The robot simply requests a list of documents, it would seem in random order, and in fact, in accordance with the aspects of the internal list or the index queue.
Speed.A short time between requests from different documents. To the time of the seconds or fractions of seconds between the requests of two formations. For some robots there are even special instructions that are specified in the Robots file. TXT, to limit the speed of the document request, so as not to overload the site.

Hown an HTML page in the eyes of a robot can look, we don't know, but we can try to imagine it, turning off the display of graphics and style design in the browser.

Thus, it can be concluded that the search robots pour the HTML page into their index, but without design elements and without pictures.

Robot Management

How can the webmaster control the behavior of search robots on his site?

As mentioned above, in 1994, a special exclusion protocol for robots was developed as a result of public debates of webmasters. To date, this protocol has not become the standard that obligedobserve all the robots without exception, remaining only in the status of strict recommendations. There is no instance where you can complain to a robot that does not comply with the exception rules, you can only prohibit access to the site already using the web server settings or network interfaces for IP addresses from which the "non-attilled" robot sent its requests.

However, robots of large search engines comply with the rules of exceptions, moreover, their extensions contribute.

On the instructions of a special Robots.txt file. And about the special meta tag Robots described in detail in Chapter 6 "How to make a site available for search engines".

With the help of additional instructions in robots.txt, which are not in the standard, some search engines allow you to more flexibly control the behavior of your robots. So, using the Crawl-Delau instruction, the webmaster can set the time interval between the sequential requests of two documents for the Robots Yahoo! and MSN, and using the NO- instruction; t Specify the address of the main mirror of the site for Yandex. However, working with non-standard instructions in Robots. TXI should be very careful because the robot of another search engine can ignore not only incomprehensible instructions, but also the entire set of rules associated with it.

You can also manage visits to search robots and indirectly, for example, the Google search engine robot will more often re-take those documents to which many referred to other sites.