Robots.txt: Rules and Syntax for SEO

The worldwide web is a vast, ever-expanding landscape of information, and search engines are our trusty guides through this electronic wild. When you create or build a website, you want search engines like Google, Bing, or Yahoo to index your content so that users can find it easily in the top results. But when you have parts of your website or a particular URL that you don't want to index or display in search results and also if you don't want to crawl your website for a particular search engine. That is where the robots.txt file comes into play.

In this robots.txt guide, we'll explain robots.txt, what it is, how it works, and best practices for using robots.txt to control search engine crawlers.

What is Robots.txt?

Robots.txt extends for "robots exclusion protocol" is a simple text file that is placed on a website's server root directory, root directory means if you have a domain https://example.com then this text file should be accessible in https://example.com/robots.txt . It's used to communicate with web crawlers (also known as spiders or bots) and instruct them on which parts of a website should or should not be crawled and indexed.

The robots.txt file is part of REP (robots exclusion protocol). The REP also includes directives, page, subdirectory, or site-wide instructions for how search engines should treat links (like "follow" or "nofollow").

We can assume it is a virtual "No Entry or Entry" sign for web crawlers, guiding them on where they are allowed to index and what content they should ignore. Robots.txt plays a vital role in managing a website's visibility in search engine results.

How Robots.txt Works

On the internet when a web crawler crawls a website, the crawler first looks for a robots.txt file in the website's root directory (e.g., https://.example.com/robots.txt). If the robtos.txt file is present in the server, it reads the rules and follows the rules accordingly.

However, it's important to understand that the robots.txt file operates on the most reputable search engines, like Google and Bing respect the directives and rules specified in the robots.txt file. However, some crawlers may ignore these rules.

Basic format of robots.txt:

        
        User-agent: [user-agent name or * for all crawlers]
Disallow: [URL string not to be crawled]

In the above Example:
Disallow: /*? uses the asterisk (*) as a wildcard to match any URL that contains a query string (indicated by the question mark ?).

For example, it would block all URLs like:

https://example.com/page?search=keyword

https://example.com/search?query=term

Disallow a directory by matching a pattern

        
        User-Agent: *
Disallow: /*/*/feed/rss/$

Here /*: The asterisk (*) is a wildcard character that matches any sequence of characters in a URL segment.

/*/: This part of the pattern consists of two asterisks separated by a forward slash (/). It means that two segments separated by a forward slash are expected.

feed/rss/: This is the specific URL path you want to disallow. It indicates that the path should end with "feed/rss/".

$: The dollar sign ($) is a special character often used for there would be no effect on its subdirectory.

A real example of a robots.txt file in the example.com domain

        
        User-Agent: *
Disallow: /admin/
Disallow: /search
Disallow: /?random/
Disallow: /author/
Disallow: /*/comment-page-*
Disallow: /comments/feed
Disallow: /*/feed/
Disallow: /feed/
Disallow: /feed/$
Disallow: /*/feed/$


User-agent: MJ12bot
Disallow: /

User-agent: Mozilla/5.0 (compatible; Ezooms/1.0; [email protected])
Disallow: /

Sitemap: https://phppot.com/sitemap.xml

Allow all crawlers and disallow particular Crawler

        
        User-Agent: *
Disallow: /folder-name/

User-agent: MJ12bot
Disallow: /

User-agent: Mozilla/5.0 (compatible; Ezooms/1.0; [email protected])
Disallow: /

Comments in robots.txt file:

        
        #some coment text
User-agent: *
Disallow:/admin/

Allow all web crawlers and all files and folder

        
        User-agent: *
Disallow:

Disallow all web crawlers:

        
        User-agent: *
Disallow: /

To disallow a particular file or path

Let's say you have a file named "example-file.php" and you want to disallow it from being indexed by web crawlers:

        
        User-agent: *
Disallow: /example-file.php
Disallow: /another-url.html
Disallow: /another-url

To disallow a particular folder and their subfolder and URL

Let's say you have a folder named "admin" and inside the admin folder you have multiple file and you want to disallow it from being indexed by web crawlers:

        
        User-agent: *
Disallow: /admin/

Disallow directive but not sub directive and sub path

        
        User-Agent: *
Disallow: /feed/$

Disallow any query or search string url

        
        User-agent: *
Disallow: /*?

Robots.txt: Rules and Syntax for SEO

What is Robots.txt?

How Robots.txt Works

Basic format of robots.txt:

Disallow a directory by matching a pattern

Related Post

A real example of a robots.txt file in the example.com domain

Allow all crawlers and disallow particular Crawler

Comments in robots.txt file:

Allow all web crawlers and all files and folder

Disallow all web crawlers:

To disallow a particular file or path

To disallow a particular folder and their subfolder and URL

Disallow directive but not sub directive and sub path

Disallow any query or search string url

Leave a comment

Feature Posts

How to export database data into excel in csv format using PHP

Login and Registration System in PHP and MySQL with oops concept

how to make bootstrap modal draggable and resizable

Uploading Multiple Files with Progress Bar via Ajax and PHP

How to export HTML table Data to CSV using only JavaScript?

Real-time wall clock using Javascript and CSS

Creating Dynamic PHP Event Management Calander using AJAX and Bootstrap

Upload Multiple Files with Progress Bar using Ajax and PHP

Enable Disable Customise and set Date Selection Using JQuery UI Datepicker

Swiper Slider Navigation with Mouse Wheel and Hamburger Menu