This Is How You Make Sure Google Indexes Your Website The Way You Want
Jul 31st - 3min.
Most companies clearly indicate with a robot.txt (and a sitemap.XML) which pages their website contains to make it easier for Google to crawl them. This way, you define all the URLs that you have.
Every website wants to be indexed or scraped by search engines. To make sure a website gets indexed, many companies clearly indicate which pages their website contains to make it easier for Google to crawl them. This can be done with robots.txt and sitemap.XML files. Thanks to these files you’re able to index your own website and define all your URLs. By determining the pages that are important for your Search Engine Optimization, you’ll push Google in the right direction.
With a robots.txt file or Robots Exclusion Protocol, you can configure which pages Google may or may not index. In short, you give Google instructions on how to crawl your site. You can find the robots.txt file of a website by looking for: domain/txt for example https://www.bol.com/robots.txt.
Nail down how much time there should be between scraping actions
A company is not okay with a single computer suddenly making 15,000 site visits in a few minutes. That's why they indicate in their robots.txt file how many seconds there should be between the automated website visits. To give you an idea, at Bol.com this is a delay of 20 seconds.
Since SEO is so important to many companies, they want to be actively indexed by Google, just think of web shops for example. In this way, they (unconsciously) also open their doors to other data miners. To prevent their site from crashing, they tell scrapers how to deal with the indexation of their website.
Apply specific restrictions to search engine bots
URLs that direct to customer shopping baskets or personalization are restricted for search engines. For each “user agent”, i.e. per search engine bot, a website or company can indicate which URLs may or may not be indexed. All other bots are (in most cases) not allowed to crawl. This is the case with LinkedIn for example. Only the defined bots are allowed to index them. Sometimes other bots are given the opportunity to apply for crawling.
Please note that robots.txt files aren’t binding. Search engines are in theory not obliged to comply with your provisions.
To understand this better, we created a metaphor: your robots.txt file is a ski slope. It tells you where to ski and what to expect on a piste, but there is no one stopping you from skiing off-piste. Search engines can also index pages without following the right ski slopes.
The robots.txt document also tells you where to find the sitemap.XML, read more about this below.
With a sitemap.XML you list all the URLs you have, with their associated metadata. For each URL, you're able to see when a product was last updated and how important certain pages (URLs) are for a website. For example, bestsellers have a higher priority to be indexed, which is quite important if you offer 20,000,000 products like Bol.com. Search engines can use this information to crawl websites in a more targeted way.