People always ask what is robots.txt? It is simply a text file. It provides directives, or instructions, to search engine crawlers about which parts of your site may be crawled and which parts may not be crawled. By allowing search engines to crawl the content on your site, you are also allowing them to index the content from your site. Indexing allows your content to be visible in search results.
If you are interested in more options for crawling a page without indexing the content – perhaps you want the crawler to follow the links on a page to reach other content you do want indexed – then this meta robots article is a good resource.
How Robots.txt Works
The first thing a search engine crawler does when it lands on a website is look for the existence of a robots.txt file. The directives in this file are assimilated before the crawler proceeds through the page.
Robots.txt is not a website requirement. If you do not have specific criteria for user-agents, then you likely don’t need to include one. If it doesn’t exist, the crawler assumes all pages and content on the site can be crawled and indexed by search engines.
While this is what robots.txt is and describes what it does, there are other things you need to understand before you move forward with creating or editing your own robots.txt file.
Where is Robots.txt Located?
The file must be placed in the site’s root directory for crawlers to find it.
The easiest way to determine if you have a properly installed robots.txt file on your website is to type http://your-domain.com/robots.txt in your browser’s address bar.
If it exists, you will see content that resembles something like this:
Otherwise, if no robots.txt exist, your browser will return a “page not found” error.
How to Interpret Robots.txt
A single robots.txt file may contain one or many sets of directives. Multiple sets must be separated by a single blank line. There can be no blank lines in the sequence of a set.
A set begins with a user-agent and then is followed by one or more directives.
Each directive only applies to the user-agent identified in the set. Here are the four primary directive options:
- Allow (only works with Googlebot)
- Crawl-delay (works with all crawlers except Googlebot)
There may be times when rules apply to more than one user-agent. In this case, the most specific set of instructions will be honored over all others for the user-agent.
Common search engine user-agents include:
- Slurp (Yahoo’s user-agent)
- Facebot (Facebook’s user-agent)
It is worth noting that there are many types of user-agents out there, but the only ones of consideration in robots.txt are search engine crawlers. Remember, we are instructing search engine crawlers how to proceed with crawling and indexing our content in robots.txt.
Even though you might give specific directives in robots.txt, it is still up to individual crawlers to interpret them. Technically, user-agents can choose to not adhere to the directives defined, (although, these are generally not upstanding user-agents, but rather malware bots and such).
The Asterisk (*) in Robots.txt
This asterisk is very significant. This special character indicates something all-encompassing.
- When you see User-agent: * it literally means all search engine crawlers.
- When you see Disallow: /*private/ it means block every filename that ends with the characters “private.”
How to Create a Robots.txt File
You need a text editor. Some popular text editors are Notepad, TextPad, Brackets, and Atom. There are many text editors to choose from, many of which are free downloads.
- Create a new file.
- Write the crawl directives for your pages.
- Each set must address only one user-agent or utilize the asterisk.
- Each may contain multiple directives.
- Each set must be separated by a single blank line.
- It must be saved as a text file. Save it as: robots.txt
- Upload the file to the root directory location of your website. When you type the following URL path in the browser (using your real domain, of course), the robots.txt content will appear.
In addition to writing content crawl directives, robots.txt is a very effective method for telling search engines where your sitemap(s) are located.
In this example, the directive is telling all search engine crawlers that the sitemap to follow is located in the website root directory in a file called sitemap.xml.
To allow search engine crawlers to crawl everything on your site:
To disallow, or forbid, search engine crawlers from crawling everything on your site:
If you want to stop Google Image Search from crawling and indexing the photos on your site:
If you want to block single web pages from being crawled and indexed:
If you want to disallow portions of a server from robots:
If you want to disallow specific file types from being crawled:
If you want to disallow any filename that contains a particular character sequence:
Contains private within the filename:
A filename that begins with private:
A filename that ends with private:
For specific questions regarding search directives, please feel free to reach out to us via the comments section below.
To learn more about our web services, check out our Web Design page. If you are interested in our other digital services or would like a quote, we would love to discuss your project with you. Call us at 904-330-0904.