Proper work with double pages. Fighting floppy pages. Can duplicate pages bring harm

Duplicate pages is one of the many reasons for lowering positions in search results and even entering the filter. To prevent this, you need to warn them into the search engine index.

Determine the presence of a double on the site and get rid of them in various ways, but the seriousness of the problem is that the duplicate is not always useless pages, they simply should not be in the index.

We will solve this problem now, only for a start to find out what a duplicate is and how they arise.

What is duplicate pages

Pupil pages is a copy of the content of the canonical (main) page, but with another URL. It is important here to note that they can be both complete and partial.

Full duplication It is an accurate copy, but with its address, the difference of which can manifest itself in the slash, the WWW abbreviation, the substitution of the parameters index.php?, Page \u003d 1, Page / 1, etc.

Partial duplication It is manifested in incomplete copying of the content and associated with the site structure, when the announcements of the articles directory, archives, content from Sidebar, Page Page and other through elements of the resource contained on the canonical page are indexed. This is inherent in most CMS and online stores in which the catalog is an integral part of the structure.

We have already spoken about the consequences of the occurrence of the oak, and this is due to the distribution of reference mass between duplicates, submenuing pages in the index, the loss of the uniqueness of the content, etc.

How to find ducky pages on site

The following methods can be used to search for a double:

google search string. With the design of the Site: myblog.ru, where myblog.ru is your URL, pages from the main index are detected. To see the duplicas, you need to go to the last page of the search results and click on the line "Show Hidden Results";
team "Advanced Search" in Yandex. Pointing in a special window address of your site and entering in quotes one of the proposals of an indexed article exposed to check, we must only get one result. If their more is a duplicate;
toolbar For webmasters in PS;
manually, Substituting in the address bar, the slash, www, html, asp, php, the letters of the upper and lower registers. In all cases, redirection must occur on the page with the main address;
special programs and services: Xenu, Megaindex, etc.

Remove sheets of pages

The removal of doubles also have several. Each of them has its impact and consequencesTherefore, it is not necessary to talk about most effective. It should be remembered that the physical destruction of an indexed duplicate is not a way out: search engines will still remember. Therefore, the best method of dealing with dubs - prevent their appearance Using the right settings of the site.

Here are some of the ways to eliminate the doubles:

Setting Robots.txt. This will allow specific pages from indexing. But if Yandex robots are susceptible to this file, then Google captures even the pages closed by him, not particularly considering its recommendations. In addition, with the help of robots.txt, remove indexed duplicas is very difficult;
301 redirect. It contributes to gluing a double with a canonical page. The method is valid, but not always useful. It can not be used in the case when duplicates must remain independent pages, but should not be indexed;
Assignment 404 errors Infected dubs. The method is very good for their removal, but will require some time before the effect manifests itself.

When nothing to glue and delete nothing, but I don't want to lose the weight of the page and get a punishment from search engines, it is used rel Canonical Href attribute.

Rel Canonical attribute on the fight against doubles

I will start with the example. In the online store there are two pages with identical content cards, but on the same goods are alphabetically, and on the other in cost. Both are needed and redirected is not allowed. At the same time, for search engines it is a clear double.

In this case, rational use of the tag link Rel Canonicalindicating the canonical page that is indexed, but the main page remains available to users.

This is done as follows: In the HEAD block of pages-duplicate, reference is specified. "Link REL \u003d" Canonical "href \u003d" http://site.ru/osnovnaya stranitsa "/"where Stranitsa is the address of the canonical page.

With this approach, the user can freely visit any page of the site, but a robot, reading the Rel Canonical attribute code, will go index only the address of which is listed in the link.

This attribute may be useful and for pagation pages. In this case, create a page "Show everything" (such "portylight") and take for canonical, and pagination pages send a robot to it through Rel Canonical.

Thus, the choice of the method of combating pages duplication depends on the nature of their emergence and necessity Presence on the site.

The reason for writing this article was the next call of an accountant with a panic before surrendering reports on VAT. Last quarter spent a lot of time cleaning the double counterparties. And again they, the same and new ones. Where?

I decided to spend time, and deal with the cause, not a consequence. The situation is mainly relevant with customized automatic unloading through the exchange plans from the control program (in my case of ut 10.3) in the company's accounting department (in my case 2.0).

Several years ago, these configurations were installed, and the automatic exchange between them is configured. Faced the problem of the peculiarity of the reference book of counterparties to the sales department, which began to start the handlog of counterparties (with the same INN / CAT / name) for one or another reasons (the same counterparty they spread in different groups). Accounting expressed its "fi", and decided - it does not matter to us that they have, combine cards when loading into one. I had to intervene in the process of transferring objects by the rules of exchange. Removed for counterparties Search by internal identifier, and left a search by Inn + GPP + name. However, and then there were their pitfalls in the form of fans to rename the names of counterparties (as a result, Dupils in the BP are already created by the rules themselves). They all gathered together, discussed, decided, convinced that in Urti we had duplicate, they removed them, returned to the standard rules.

That's just after the "combing" doubles in UT and in BP - internal identifiers from many counterparties differed. And since the type of exchange rules carry out the search for objects exclusively on the internal identifier, then with the next portion of documents in the BP, a new double counterparty double (in the event that these identifiers differed). But the universal exchange of XML data would not be universal, if this problem was impossible to get around. Because It is impossible to change the identifier of the existing object by standard means, then you can bypass this situation using a special compliance register "Compliance of objects for exchanging", which is available in all standard configurations from 1C.

In order not to have the new doubles, the cleaning algorithm of the doubles became as follows:

1. In BP using the processing "Search and Replacing the Duplicate Elements" (it is typical, it can be taken from the configuration. Trade control or on the ITS disk, or select the most appropriate among the set of variations on the infostar itself) I find a double, I define the faithful item, click executing replacement.

2. I get the internal identifier of the only (after replacement) of the object of our double (sketched specifically simple processing for this so that the internal identifier is automatically copied to the clipboard).

3. I open the "Compliance of Objects for Exchange" in UT, I make a selection by my own link.

Fighting Double Page

The owner may not suspect that on its website some pages have copies - most often it happens. Pages are open, with their contents are all in order, but if only to pay attention to, then it can be noted that with the same content of the address different. What does it mean? For live users, nothing, since they are interested in information on pages, but the soulless search engines perceive such a phenomenon completely differently - for them it is completely different pages with the same content.

Are DOUBLE pages are harmful? So, if the ordinary user can not even notice the presence of a double on your site, then search engines will immediately determine. What reaction from them to wait? So, in fact, the copies are seen as different pages, then the content on them ceases to be unique. And this already negatively affects ranking.

Also, the presence of a dub is blocked, which the optimizer tried to focus on the target page. Because of the double, he may not be at all on that page that he wanted to move. That is, the effect of inner translets and external references can be repeatedly reduced.

In the overwhelming majority in the occurrence of the double, it is to blame - due to the wrong settings and the absence of due attention of the optimizer is generated clear copies. With this, many CMS are sinning, for example, Joomla. To solve the problem, it is difficult to choose a universal recipe, but you can try to use one of the plug-ins to delete copies.

The emergence of fuzzy doubles, in which the content is not fully identical, usually occurs due to the fault of the webmaster. Such pages are often found on online store websites, where pages with goods are characterized by only a few sentences with a description, and the rest of the content consisting of through blocks and other elements is the same.

Many specialists argue that a small amount of doubles will not hurt a site, but if more than 40-50% more than 40-50%, then the resource can wait for serious difficulties. In any case, even if the copies are not so much, it is worthwhile to do with their elimination, so you are guaranteed to get rid of problems with the dubs.

Search Page copies There are several ways to search for duplicate pages, but first you should contact several search engines and see how they see your site - you only need to compare the number of pages in the index of each. This is quite simple, without resorting any additional means: in Yandex or Google enough in the search string, enter Host: Yoursite.ru and look at the number of results.

If, after such a simple check, the quantity will be very different, 10-20 times, then this is with some more likely to talk about the contents of the dub in one of them. Page copies can be not to blame for such a difference, but nevertheless it gives a reason for further more thorough search. If the site is small, you can manually calculate the number of real pages and then compare with indicators from search engines.

Search duplicate pages You can search for a URL in the issuance of the search engine. If they have to be CNC, then pages with a URL of incomprehensible characters, like "index.php? S \u003d 0F6B2903D", will immediately be embarrassed from the general list.

Another way to determine the presence of a duplicate by the means of search engines is a search on text fragments. The procedure for such an inspection is simple: you need to enter a text fragment out of 10-15 words from each page in the search string, and then analyze the result. If there will be two or more pages in extradition, there are copies, if the result is only one, then there are no doubles from this page, and you can not worry.

It is logical that if the site consists of a large number of pages, then such a check can turn into an impracticable routine for an optimizer. To minimize time costs, you can use special programs. One of these tools, which is probably a sign of experienced specialists, is Xenu`s Link Sleuth.

To check the site, you need to open a new project by selecting the "File" menu "check URL", enter the address and click "OK". After that, the program will begin processing all the URL of the site. At the end of the check, you need to export the received data to any convenient editor and start looking for a double.

In addition to the above methods in the tools of the Yandex.Vebmaster panels and Google WebMaster Tools, there are means for checking indexing pages that can be used to search for a double.

Methods for solving the problem After all the duplicas are found, their elimination will be required. This can also be done in several ways, but for each specific case you need your own method, it is possible that everyone will have to use.

Copy pages can be deleted manually, but this method is rather suitable only for those doubles, which were created by manual way to the inconsistency of the webmaster.
Redirect 301 is great for gluing pages-copies whose url is distinguished by the presence and absence of WWW.
Solving problems with doubles using the canonical tag can be used for fuzzy copies. For example, for categories of goods in the online store, which have a duplicate, distinguished by sorting in various parameters. Canonical is also suitable for versions of pages for printing and in other similar cases. It is used quite simply - for all copies, the REL \u003d "Canonical" attribute is indicated, and for the main page that is most relevant - no. The code should look something like this: Link Rel \u003d "canonical" href \u003d "http://yoursite.ru/stranica-kopiya" /, and stand within the head tag.
In the fight against doubles can help configure the robots.txt file. The Disallow directive will allow you to close access to dubs for search robots. You can read more about the syntax of this file in our mailing.

Duplicas are pages on the same domain with identical or very similar content. Most often appear due to the features of the work of CMS, errors in Robots.txt directives or in setting 301 redirects.

What is the danger of doubles

1. Incorrect identification of the relevant page of the search robot. Suppose you have one and the same page available on two URLs:

Https://site.ru/kepki/

Https://site.ru/catalog/kepki/

You have invested money in the promotion of the page https://site.ru/kepki/. Now it refers to thematic resources, and it ranked positions in the top 10. But at some point, the robot eliminates it from the index and in return adds https://site.ru/catalog/kepki/. Naturally, this page is ranked worse and attracts less traffic.

2. Increasing the time required for crossing the site by robots. On the scan of each site robots allocated limited time. If a lot of doubles, the robot may not get to the main content, because of which the indexing will be delayed. This problem is particularly relevant for sites with thousands of pages.

3. The imposition of sanctions on the part of the search engines. By themselves, the duplicas are not a reason for the pessimization of the site - as long as search algorithms do not count that you create a swill intentionally in order to manipulate the issuance.

4. Problems for webmasters. If the work on eliminating the doubles to postpone in a long box, they can be accumulated by such a quantity that the webmaster is purely physically it will be difficult to process reports, systematize the reasons for the dubs and make adjustments. Large work increases the risk of errors.

Dupils are conventionally divided into two groups: explicit and implicit.

Explicit duplicas (page available on two or more url)

There are many options for such doubles, but they are all like their essence. Here are the most common.

1. URL with a slash at the end and without it

Https://site.ru/list/

Https://site.ru/list.

What to do: Configure server response "HTTP 301 MOVED PERMANENTLY" (301th Redirect).

How to do it:

- find in the root folder of the site file.htaccess and open (if there is no - create in TXT format, call.htaccess and put in the site root);
- prescribe in the file file for redirect with the URL with a slash on the URL without a slash:

RewriteCond% (Request_FileName)! -D
RewriteCond% (Request_uri) ^ (. +) / $
Rewriterule ^ (. +) / $ / $ 1

- reverse operation:

RewriteCond% (Request_FileName)! -F
RewriteCond% (Request_uri)! (. *) / $
Rewriterule ^ (. * [^ /]) $ $ 1 /

- if the file is created from scratch, all redirects must be prescribed inside such lines:

…

Configuring 301 redirect with .htaccess is suitable only for apache sites. For NGINX and other servers, redirect is configured in other ways.

What url is preferred: with or without slam? Pure technically - no difference. Look in the situation: if more pages are indexed with a slash, leave this option, and vice versa.

2. URL with www and without www

Https://www.site.ru/1.

Https://site.ru/1.

What to do: Specify the main mirror of the site in the webmaster panel.

How to do this in Yandex:

- go to Yandex.Vebmaster

- select the site in the panel from which the redirection will go (most often redirected to the URL without www);
- go to the "Indexing / Site Moving" section, remove the checkbox in front of the "Add www" item and save the changes.

Within 1.5-2 weeks of Yandex, the mirrors will reinperse the pages, and only the URL without www will appear in the search.

Important! Previously, to specify the main mirror in the Robots.txt file, it was necessary to prescribe a HOST directive. But it is no longer supported. Some Webmasters "for the Safety" still indicate this directive and for even greater confidence set 301 redirect - this is not necessary, it is enough to adjust the gluing in the webmaster.

How to glue mirrors in Google:

- go to Google Search Console. and add 2 versions of the site - with www and without www;

- select the site from which the redirection will go from the Search Console;
- click on the gear icon in the upper right corner, select the "Site Settings" item and select the main domain.

As in the case of Yandex, additional manipulations with 301 redirects are not needed, although it is possible to implement gluing with it.

What should be done:

- unload a list of indexed URLs from Yandex.Webmaster;
- download this list into the Seopult list tool or using the XLS file (detailed instructions for using the tool);

- run the analysis and download the result.

In this example, the Phagination page is indexed by Yandex, and Google is not. The reason is that they are closed from indexing in robots.txt only for the Bot Yandex. Solution - set up canonization for pagination pages.

Using the parser from Seopult, you will understand, duplicate pages in both search engines or only in one. This will allow you to choose optimal problem solving tools.

If you do not have time or experience to deal with the doubles, order an audit - in addition to the presence of a double, you will get a lot of useful information about your resource: the presence of errors in HTML code, headlines, meta tags, structure, internal passing, usability, The optimization of content, etc. As a result, you will have ready-made recommendations on your hands, which will make the site more attractive to visitors and increase its position in the search.