Robots.txt: Rules and Syntax for SEO
The worldwide web is a vast, ever-expanding landscape of information, and search engines are our trusty guides through this electronic wild. When you create or build a website, you want search engines like Google, Bing, or Yahoo to index your content so that users can find it easily in the top results. But when you have parts of your website or a particular URL that you don't want to index or display in search results and also if you don't want to crawl your website for a particular search engine. That is where the robots.txt file comes into play.
In this robots.txt guide, we'll explain robots.txt, what it is, how it works, and best practices for using robots.txt to control search engine crawlers.
What is Robots.txt?
Robots.txt extends for "robots exclusion protocol" is a simple text file that is placed on a website's server root directory, root directory means if you have a domain https://example.com then this text file should be accessible in https://example.com/robots.txt . It's used to communicate with web crawlers (also known as spiders or bots) and instruct them on which parts of a website should or should not be crawled and indexed.
The robots.txt file is part of REP (robots exclusion protocol). The REP also includes directives, page, subdirectory, or site-wide instructions for how search engines should treat links (like "follow" or "nofollow").
We can assume it is a virtual "No Entry or Entry" sign for web crawlers, guiding them on where they are allowed to index and what content they should ignore. Robots.txt plays a vital role in managing a website's visibility in search engine results.
How Robots.txt Works
On the internet when a web crawler crawls a website, the crawler first looks for a robots.txt file in the website's root directory (e.g., https://.example.com/robots.txt). If the robtos.txt file is present in the server, it reads the rules and follows the rules accordingly.
However, it's important to understand that the robots.txt file operates on the most reputable search engines, like Google and Bing respect the directives and rules specified in the robots.txt file. However, some crawlers may ignore these rules.
Basic format of robots.txt:
User-agent: [user-agent name or * for all crawlers]
Disallow: [URL string not to be crawled]
In the above Example:
Disallow: /*? uses the asterisk (*) as a wildcard to match any URL that contains a query string (indicated by the question mark ?).
For example, it would block all URLs like:
https://example.com/page?search=keyword
https://example.com/search?query=term
Disallow a directory by matching a pattern
User-Agent: *
Disallow: /*/*/feed/rss/$
Here /*: The asterisk (*) is a wildcard character that matches any sequence of characters in a URL segment.
/*/: This part of the pattern consists of two asterisks separated by a forward slash (/). It means that two segments separated by a forward slash are expected.
feed/rss/: This is the specific URL path you want to disallow. It indicates that the path should end with "feed/rss/".
$: The dollar sign ($) is a special character often used for there would be no effect on its subdirectory.
A real example of a robots.txt file in the example.com domain
User-Agent: *
Disallow: /admin/
Disallow: /search
Disallow: /?random/
Disallow: /author/
Disallow: /*/comment-page-*
Disallow: /comments/feed
Disallow: /*/feed/
Disallow: /feed/
Disallow: /feed/$
Disallow: /*/feed/$
User-agent: MJ12bot
Disallow: /
User-agent: Mozilla/5.0 (compatible; Ezooms/1.0; [email protected])
Disallow: /
Sitemap: https://phppot.com/sitemap.xml
Allow all crawlers and disallow particular Crawler
User-Agent: *
Disallow: /folder-name/
User-agent: MJ12bot
Disallow: /
User-agent: Mozilla/5.0 (compatible; Ezooms/1.0; [email protected])
Disallow: /
Comments in robots.txt file:
#some coment text
User-agent: *
Disallow:/admin/
Allow all web crawlers and all files and folder
User-agent: *
Disallow:
Disallow all web crawlers:
User-agent: *
Disallow: /
To disallow a particular file or path
Let's say you have a file named "example-file.php" and you want to disallow it from being indexed by web crawlers:
User-agent: *
Disallow: /example-file.php
Disallow: /another-url.html
Disallow: /another-url
To disallow a particular folder and their subfolder and URL
Let's say you have a folder named "admin" and inside the admin folder you have multiple file and you want to disallow it from being indexed by web crawlers:
User-agent: *
Disallow: /admin/
Disallow directive but not sub directive and sub path
User-Agent: *
Disallow: /feed/$
Disallow any query or search string url
User-agent: *
Disallow: /*?