When a search engine robot going to crawl your website eg:www.example.com,it first checks the “robots.txt” file of your website eg:wwww.example.com/robots.txt.And then it decides which pages to visit and what to index in search results.This tutorial will show you different rules of creating “robots.txt” file.
• What is “robots.txt” file?
Website owners use the “/robots.txt” file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
The file is placed in the main directory of a website and advises spiders and other robots which directories or files they should not access. The file is purely advisory — not all spiders bother to read it let alone heed it. However, most, if not all, the spiders sent by the major search engines to index your site will read it and obey the rules contained within the file.
• How To Create A “robots.txt” File?
The syntax for using the keywords is as follows:
User-agent: [the name of the robot the following rule applies to]
Disallow: [the URL path you want to block]
Allow: [the URL path in of a subdirectory, within a blocked parent directory, that you want to unblock]
“robots.txt” file is a simple text file so use notepad or any other text editors.The content of a “robots.txt” file consists of so-called “records”.
So in a “robots.txt” records their are two fields.First one is information about search engine spider and it is addressed by “User-agent” and second one is “Disallow” lines.”Disallow” may have more than one line.Here is an example:
The above “robots.txt” record says all search engine spiders can crawl or visit all pages of your website.
“User-agent: *” → “*” denotes “all search engine spider” and
“Disallow: ” → can crawl or visit all pages
You can also specify a particular search engine spider or robot in “User-agent” field instead of “*” and specify which page have to ignore while visiting website.Search engine spider name of different search engines are different.Some of the famous search engine spiders are:
Google — Googlebot
Bing — Bingbot
MSN — MSNbot
Ask — Teoma
Look at the example below:
Above “robots.txt” record would allow the “googlebot”, which is the search engine spider of Google, to access every page from your website except the pages from “/myfiles” directory. All files in the “/myfiles” directory will be ignored by googlebot.
eg: Google bot will not access “www.example.com/myfile/xxxxxxxxxxxx”
The above “robots.txt” records says all search engine spiders are allowed to crawl or visit every page from your website except the pages from “/myfiles” directory and “/cgi-bin/” directory.See here i added two “Disallow” lines.Like that you can add infinite number of lines.
Sometimes you can also use “Allow” in records.And can add XML Sitemap link of your website so that search engine spiders can find new pages quickly.
The above record says that all search engine spiders are allowed to visit and index every page of your website and also can access your sitemap.This is the best option for most websites.
You can also write above record using “Disallow” too.Both means same thing.Like this:
Next am going to show you a “robots.txt” file which will hide your entire website from search engine.Like below one:
This record will stop all search engine spiders to access your website.So it is advised that not to use this “robots.txt” record.
You can also use “Allow” and “Disallow” together.Like below one:
The above record would allow all search engine spiders to access entire website except files from “/myfiles,/search,/cgi-bin/”
• Keep In Mind While Writing “robots.txt” file
1. Only One Directory/File per Disallow line
Don’t put multiple directories on “Disallow” line.It will not work.
so put one directory per “Disallow” lineEg:
2. Don’t List Your Secret Directories
Since along with robots any one can access “robots.txt” records of your website by going to “www.example.com/robots.txt”.So it will be better not to list your secret directory in “Disallow” line.
3.You must save your robots.txt code as a text file,
4.You must place the file in the highest-level directory of your site (or the root of your domain)
5.The robots.txt file must be named robots.txt.
•Testing Your “robots.txt” File
You can test your robots.txt file to ensure it works as you expect it to – we’d recommend you do this with your robots.txt file even if you think it’s all correct.
The testing tool was created by Google to allow webmasters to check their robots.txt file. To test your robots.txt file, you’ll need to have the site to which it is applied registered with Google Webmaster Tools . You then simply select the site from the list and Google will return notes for you where it highlights any errors.
Hopes you all like this tutorial.Now go and control your search results.Control search engine spiders to hide unwanted materials from your website in search results.
If you have any doubts please comment below.