Robots.txt: what is it for?
When we talk about Google, we often mention the famous robots responsible for analyzing each website. Also called “spiders”, Google’s robots explore URLs, links on a page, and HTML source code. It is the passage of these bots that will allow your site to be indexed.
The sitemap.XML file makes it easier for robots to access a site. The robots.txt is, on the contrary, an exclusion protocol that notably prohibits robots from exploring and indexing certain pages of your site. The invention of the robots.txt file is attributed to Martin Koster who wanted to regulate the analysis capacity of the robots responsible for crawling the pages of a website.
But then, why prevent robots from accessing your site? How can the robots.txt be useful for natural referencing and how to set it up? The OMT team explains everything in detail!
What is a robots.txt?
The robots.txt is a text file that tells Google robots which pages to crawl on a website. More precisely, this file makes it possible to set up an exclusion protocol indicating to the robots the pages which should be explored and those which should not be explored. Please note that the robots.txt file is intended only for search engine indexing robots: the pages that they should not analyze remain accessible to Internet users.
The robots.txt file is set up by the developer at the root of the website. It does not allow an already indexed page to be unindexed since its purpose is to prevent crawling. If a page has already been crawled, it will therefore remain indexed. These are the “no index” tags that allow the developer of a website to prevent certain pages from being indexed.
Why set up a robots.txt file?
Why prevent Google robots from crawling certain pages of a website? Quite simply to optimize the robots’ crawl time! This file is for example used by websites with a large number of URLs, it will prevent robots from spending time crawling on pages whose indexing is not a priority. In no way a robots.txt is used to prevent a web page from showing up on the SERPs, it is simple to manage the traffic of robots on your website to indicate priorities to them.
When the crawler robots arrive on a site, they start by downloading the robots.txt file: in this way, they first analyze the rules linked to the site. After reading these rules and indications, they embark on the exploration of the website.
So how do you determine which pages should not be crawled first? It can be very useful to prevent robots from crawling pages containing duplicate content or even the internal search engine displayed on your website. It can also be confidential content or internal resources such as a specification or a white paper.
The robots.txt file can prevent the crawling of three types of content:
– A Web page.
– A resource file.
– A multimedia file.
The robots.txt file applied to a web page will help manage robot traffic on your website. This will avoid being overwhelmed by the passage of robots or prioritize pages that are more deserving of indexing. When these pages contain a robots.txt, they can still appear on the SERPs, however, they will not contain any description.
How to create a robots.txt file?
The robots.txt files are located at the root of a website, concretely they take this form: www.example.com/robots.txt. Such a file can have several rules: disallow, allow, and sitemap. The “user agent” directive is mandatory since it allows you to specify which robot the file is aimed at. The asterisk is used to target all crawlers.
It can be a targeted robots.txt, containing the URL address of a page or simply containing an instruction. Each rule has its syntax, if this is not respected down to the character, the file will not work.
Here is the example file given by Google, it is first to target the “user-agent” robot responsible for crawling the site, in this case, the “Google bot”.
User agent: Googlebot
Disallow: /nogooglebot/
User-agent: *
Allow: /
Sitemap: http://www.example.com/sitemap/xml
Attention, the robots.txt file must always be written in lowercase and it must be located at the root of the site. A website can only contain one robots.txt file.
Robots.txt best practices and mistakes to avoid
The main risk of the robots.txt file is to block access to pages that are considered to have priority. We recommend that you write this file in a simple notepad, making sure that there are no blank lines in the directive blocks. Also, be sure to follow the order when writing these guidelines.
Avoid writing your file in a classic text editor (Page or Word type), special characters could slip into the file and prevent the directives from being taken into account by the robots.
We advise you to test your file before importing it. After testing and importing, Google robots will find the file without your intervention. Importing a robots.txt file depends on the server and the architecture of each website. It is generally sufficient to drag the file to the location provided by your server.
Be aware that crawling by Google robots is allowed by default. You don’t need to include an “Allow” directive for the pages you want to prioritize. The “Allow” directive will simply replace the “Disallow” in the same robots.txt file.
Why do website pages sometimes need to be de-indexed?
The Google robots responsible for crawling the web are not in the lace: in the absence of any indication from you, they will index all the pages of your website even if these are of no interest to Internet users or for SEO. This crawl method was rather beneficial at the time when it was considered that it was necessary to index a maximum of pages to be present on search engines. However, the situation has changed since the Panda update.
Since the 2010s, Google has been fighting spam. To do this, the web giant has implemented a filter system called “Panda”. Objectives of the filter: penalize poor-quality sites and favor relevant content. De – indexing, therefore, corresponds to common sense: it is better to favor quality over quantity!
De-indexing can even be favorable to your good natural referencing. Like the robots.txt file, de-indexing makes it possible to optimize your crawl budget. By de-indexing useless pages, you organize the robots’ crawl time and give priority to web pages that you really want to enhance.
The de-indexing process can be useful for several reasons: some pages are indexed by mistake, others are “duplicate content”, they are of poor quality, or contain content that may pose problems for your business (health claims, sensitive, etc).
Which web pages should be de-indexed?
Pages with obsolete content: your site contains a “Journal” or a “blog” on which you regularly publish articles related or not to the news of your field of activity. If you follow a regular publication schedule, some articles may no longer be relevant. If you plan to update them, do not delete them from your site: simply unindex them and plan to index them again when they are no longer obsolete.
Duplicate content: Google sometimes severely penalizes sites with duplicate content. Even by fighting against duplicate content, it happens that a site contains similar content: this is the case, for example, when an article is available online and in a downloadable or printable PDF version.
Pages with little content or irrelevant content: yes, they exist and many website owners do not think about them. It can simply be a page that thanks the Internet user for his purchase or a forum page with no response.
Protected content: these are, for example, forms filled out by Internet users and containing personal data.
How to de-index your web pages on Google?
There are two main methods of de-indexing: the meta tag and the X-Robots-Tag.
The robots meta tag will give robots the indexing directive of a page. If the tag says “content = index, follow” then the content will not be indexed, but the links on the page will be followed. If the tag says “content=noindex, follow” then the page will not be indexed and the links cannot be followed. If the page in question is already indexed, it will be unindexed, if however it is not yet indexed it will not be in the future.
The X-Robots-Tag is useful for de-indexing non-HTML content such as PDF documents or Excel files for example. In particular, it will allow you to unindex images.
The de-indexing process may take some time. Your page will only be de-indexed when Google robots decide to access it.
After having deindexed your useless pages, establish a rhythm of watching. Thanks to the Search Console, you will be able to follow the number of pages indexed by the search engine. To speed up the de-indexing process, you can also provide Google with a sitemap entirely dedicated to the pages you wish to de-index. If content needs to be removed quickly from your site, go directly to Search Console to ask Google to temporarily remove a URL from its index.