Due to a rise in the wrong use of 403/404 response codes, which could hurt websites, Google wrote about how to reduce Googlebot’s crawl rate in the right way.
In the advice, it was said that web publishers and content delivery networks were using response codes in bad ways more and more.
Slowing Down Googlebot (Rate Limiting)
Googlebot is the software that Google uses to automatically visit (crawl) websites and download their content.
Rate limiting Googlebot means slowing down how quickly Google crawls a website.
Google’s crawl rate is the number of page requests that Googlebot makes every second.
A publisher might want to slow down Googlebot if, for example, it’s putting too much stress on the server.
Google suggests several ways to slow down the rate at which Googlebot crawls your site. The Google Search Console is the best way to do this.
When you use rate limiting in Search Console, the crawl rate will slow down for 90 days.
Google’s crawl rate can also be changed by using Robots.txt to stop Googlebot from crawling specific pages, directories (categories), or the whole website.
Robots.txt is good because it only tells Google not to crawl a site; it doesn’t tell Google to take a site out of the index.
But using the robots.txt file can have “long-term effects” on the way Google crawls websites.
Maybe that’s why Search Console is the best way to fix it.
Google: Stop Putting Rate Limits on 403 & 404
Google told publishers on its Search Central blog that they shouldn’t use 4XX response codes (except for 429 response code).
In the blog post, the misuse of the 403 and 404 error response codes for rate limiting was specifically mentioned. However, the advice applies to all 4XX response codes except the 429 response.
The recommendation is needed because more and more publishers are using these error response codes to slow down Google’s crawl rate.
The 403 response code says that the visitor (in this case, Googlebot) is not allowed to see the page.
The 404 response code tells Googlebot that the page is completely gone.
“Too many requests” is a valid error message for server error response code 429.
If they keep using these two error response codes, Google may drop pages from their search index over time.
That means that the pages won’t be taken into account when figuring out where to put them in the search results.
Over the last few months we noticed an uptick in website owners and some content delivery networks (CDNs) attempting to use 404 and other 4xx client errors (but not 429) to attempt to reduce Googlebot’s crawl rate.
The short version of this blog post is: please don’t do that…Google
In the end, Google says to use error response codes 500, 503, or 429.
The 500 response code means that the server made a mistake on its own. The 503 response means that the server can’t handle the request for a webpage.
Google thinks of both of these answers as temporary mistakes. So, it will come back later to see if the pages are still available.
A 429 error response tells the bot that it’s making too many requests, and it can also tell the bot to wait for a certain amount of time before crawling again.
Google says that if you want to limit Googlebot’s speed, you should look at their Developer Page.
What Rate Limiting with
4xx does to Googlebot
All 4xx HTTP status codes, except for 429, will cause Google Search to remove your content. Worse, if you also serve your robots.txt file with a 4xx HTTP status code, it will be treated as if it didn’t exist.
If you had a rule that said crawlers couldn’t look at your dirty laundry, Googlebot now knows about it, which isn’t good for either party.