Every search engine bot first reads a website’s robots.txt file to understand its crawling rules. This makes the robots.txt file a key part of SEO for any Blogger blog. If your posts are not being indexed properly, setting up a custom robots.txt file can fix the problem and improve your blog’s visibility.
In this guide, I will explain what a robots.txt file is, why it is important for SEO, and how you can create a well-optimized custom robots.txt file for your Blogger website. I will also show you how to manage blocked pages reported in Google Search Console and help index your articles faster.
By following these steps, you can ensure that search engines crawl your content efficiently and boost the search performance of your Blogger blog.
What is a robots.txt file?
The robots.txt file provides information about which URLs in their database are allowed to be crawled by search engine crawlers or bots.
This mainly helps prevent your website from being overloaded with too many crawl requests and also saves server bandwidth.
Using this, you can block unnecessary pages from being crawled while allowing important pages, which helps save server resources.
The robots.txt file belongs to the Robots Exclusion Protocol (REP), a set of web standards that govern how robots or web crawlers browse the web, access and index content, and present that content to users.
Typically, the robots.txt file is placed in the root folder of a website and can be accessed using a URL like this:
https://example.com/robots.txt
This way, you can quickly check your Blogger site’s robots.txt file by adding robots.txt after your homepage URL, as shown in the example above.
Structure of default Robots.txt file
The standard format of a robots.txt file looks like this:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
A single robots.txt file can contain multiple lines containing different user agents and directives (such as disallows, allows, crawl-delays, etc.).
There are five commonly used terms in a robots.txt file.
User-agent: This specifies the web crawler to which the directive applies (most often a search engine).
disallow: This directive tells the user-agent not to access a specific URL. Only one “Disallow:” line is allowed per URL.
allow: (Applies to Googlebot only) This directive allows Googlebot to access a page or subfolder, even if the parent page or folder is disallowed.
Crawl-delay: This tells the web crawler how many seconds to wait before fetching and crawling the next page. It helps reduce server load.
Sitemap: This command directs web crawlers to crawl the XML sitemap(s) associated with this URL. Google, Bing, Yahoo, and Ask support this command.
Comments: Any line that begins with “#” is a comment. Comments are ignored by crawlers, but they help humans understand the rules and document them. For example, # This comment explains the rule.
How to check robots.txt?
To check the contents of the robots.txt file, follow these steps:
Find the robots.txt file: The robots.txt file is usually placed in the root directory of the website you want to check. For example, if your website is www.example.com, you can find the file at www.example.com/robots.txt.
Access the file: Open a browser and type the full URL of the robots.txt file in the address bar. For example, www.example.com/robots.txt. This will display the contents of the file directly in your browser.
Review the file: Look carefully at the contents of the robots.txt file. It contains directives that guide web crawlers, such as search engine bots, which pages to crawl and which to block. The file follows a specific syntax and set of rules. Make sure the directives are written correctly and match your desired instructions for search engines.
Validate syntax: You can check the syntax of a file using online robots.txt validation tools. These tools will analyze the file and highlight any errors or issues. Some widely used validators include Google’s Robots.txt Tester, Bing Webmaster Tools, and other third-party platforms.
Test with a web crawler: Once the syntax is verified, you can test functionality using a web crawler or search engine bot simulator. These tools show how search engines interpret your robots.txt rules and which pages they can index. Popular options include Screaming Frog SEO Spider, SiteBulb, or NetPeak SEO Spider.
By following these steps, you can ensure that your robots.txt file is working properly, formatted correctly, and aligned with your instructions for search engine bots.
Default Robots.txt File for Blogger Blog
To improve SEO for Blogger blog, it is important to understand the CMS structure and review the default robots.txt file. Here is the default robots.txt file used in Blogger:
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /
Sitemap: https://www.example.com/sitemap.xml
The first line defines the type of bot. Here, it is Google AdSense, which is not restricted (the second line is empty). This means that AdSense ads can appear on all pages of the site.
The next section is for all other bots (*), which are not allowed to crawl/search pages. This prevents search and label pages (which share the same URL structure) from being indexed. The allow rule ensures that all pages except disallowed pages can be crawled.
The last line contains the Blogger post sitemap.
This default file works well for managing how search engine bots crawl your blog. However, it allows indexing of archive pages, which can lead to duplicate content issues. In such cases, it can create unnecessary pages for the Blogger site.
Optimizing Robots.txt for Blogger Blogs
After analyzing the default robots.txt, we can optimize it for better SEO performance.
The default setup allows indexing of archive pages, which can result in duplicate content. To fix this, /search* should be used to block all search and label pages from being crawled.
Adding the disallow rule /20* prevents crawling of archive sections. Since this can block all posts, we need an allow rule for /*.html to ensure that posts and pages can be crawled.
By default, a sitemap only includes posts, not pages. Therefore, you should add a sitemap of pages located at https://example.blogspot.com/sitemap-pages.xml or for a custom domain https://www.example.com/sitemap-pages.xml. Submitting these sitemaps to Google Search Console helps with indexing.
Here is a custom robots.txt optimized for a Blogger blog:
User-agent: Mediapartners-Google
Disallow:
User-agent: * # select all crawling bots and search engines
Disallow: /search* # block all user-generated query pages
Disallow: /20* # prevent crawling of Blogger archive sections
Disallow: /feeds* # stop feeds from being crawled
Allow: /*.html # allow all posts and pages to be crawled
# Sitemap of the blog
Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-pages.xml
- /search* Prevents search and label pages from being crawled.
- /20* Prevents archive sections from being crawled.
- Disallow: /feeds* Blocks feed URLs. Use this only if you haven’t created a new Blogger XML sitemap.
- Allow: /*.html Ensures that all posts and pages are accessible to search engines.