Inside Google Sitemaps: February 2006

Your source for product news and developments

Come by and say hi


This week, some of the Sitemaps team will be at Search Engine Strategies NYC. If you're planning to be there on Monday, come have lunch with us and Matt Cutts. We'll be around throughout the conference and we'd love to talk to you. Please come over and say hi if you see us.

Also, last week, I talked all things Sitemaps with GoodROI on his show GoodKarma. Check out the podcast on Webmaster Radio.

Permalink

Using a robots.txt file


A couple of weeks ago, we launched a robots.txt analysis tool. This tool gives you information about how Googlebot interprets your robots.txt file. You can read more about the robots.txt Robots Exclusion Standard, but we thought we'd answer some common questions here.


What is a robots.txt file?

A robots.txt file provides restrictions to search engine robots (known as "bots") that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages.


Does my site need a robots.txt file?

Only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one).


Where should the robots.txt file be located?

The robots.txt file must reside in the root of the domain. A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location. But, http://www.example.com/mysite/robots.txt is not. If you
don't have access to the root of a domain, you can restrict access using the Robots META tag.


How do I create a robots.txt file?

You can create this file in any text editor. It should be an ASCII-encoded text file, not an HTML file. The filename should be lowercase.


What should the syntax of my robots.txt file be?

The simplest robots.txt file uses two rules:



  • User-Agent: the robot the following rule applies to

  • Disallow: the pages you want to block

These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines in one entry.>


User-Agent

A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:


User-Agent: *

Disallow

The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).



  • To block the entire site, use a forward slash.
    Disallow: /

  • To block a directory, follow the directory name with a forward slash.
    Disallow: /private_directory/

  • To block a page, list the page.
    Disallow: /private_file.html


URLs are case-sensitive. For instance, Disallow: /private_file.html would block http://www.example.com/private_file.html, but would allow http://www.example.com/Private_File.html.


How do I block Googlebot?

Google uses several user-agents. You can block access to any of them by including the bot name on the User-Agent line of an entry.



  • Googlebot: crawl pages from our web index

  • Googlebot-Mobile: crawls pages for our mobile index

  • Googlebot-Image: crawls pages for our image index

  • Mediapartners-Google: crawls pages to determine AdSense content (used only if you show AdSense ads on your site).


Can I allow pages?

Yes, Googlebot recognizes an extension to the robots.txt standard called Allow. This extension may not be recognized by all other search engine bots, so check with other search engines you're interested in to find out. The Allow line works exactly like the Disallow line. Simply list a directory or page you want to allow.


You may want to use Disallow and Allow together. For instance, to block access to all pages in a subdirectory except one, you could use the following entries:


User-Agent: Googlebot
Disallow: /folder1/

User-Agent: Googlebot
Allow: /folder1/myfile.html

Those entries would block all pages inside the folder1 directory except for myfile.html.


I don't want certain pages of my site to be indexed, but I want to show AdSense ads on those pages. Can I do that?

Yes, you can Disallow all bots other than Mediapartners-Google from those pages. This keeps the pages from being indexed, but lets Googlebot-MediaPartners bot analyze the pages to determine the ads to show. Googlebot-MediaPartners bot doesn't share pages
with the other Google user-agents. For instance, you could use the following entries:a


User-Agent: *
Disallow: /folder1/

User-Agent: MediaPartners-Google
Allow: /folder1/

I don't want to list every file that I want to block. Can I use pattern matching?

Yes, Googlebot interprets some pattern matching. This is an extension of the standard, so not all bots may follow it.


Matching a sequence of characters

You can use an asterisk (*) to match a sequence of characters. For instance, to block access to all subdirectories that begin with private, you could use the following entry:


User-Agent: Googlebot
Disallow: /private*/

How can I make sure that my file blocks and allows what I want it to?

You can use our robots.txt analysis tool to:



  • Check specific URLs to see if your robots.txt file allows or blocks them.

  • See if Googlebot had trouble parsing any lines in your robots.txt file.

  • Test changes to your robots.txt file.


Also, if you don't currently use a robots.txt file, you can create one and then test it with the tool before you upload it to your site.


If I change my robots.txt file or upload a new one, how soon will it take affect?

We generally download robots.txt files about once a day. You can see the last time we downloaded your file by accessing the robots.txt tab in your Sitemaps account and checking the Last downloaded date and time.

Permalink

We'd like your feedback on a potential new verification process


Before we show you site stats, we verify site ownership by asking you to upload a uniquely named file. Some webmasters can't do this (for instance, they can't upload files or they can't choose filenames). We are considering adding an alternate verification option and would like your feedback. This option would be an additional choice. It would not replace the current verification process. If you have already verified your site, you won't need to do anything new.

This alternate verification process would require that you place a META tag in the <HEAD> section of your home page. This META tag would contain a text string unique to your Sitemaps account and site. We would provide you with this tag, which would look something like this:

<meta name="owner-v1" contents="unique_string">

We would check for this META tag in the first <HEAD> section of the page, before the first <BODY> section. We would do this so that if your home page is editable (for instance, is a wiki-type page or a blog with comments), someone could not add this META tag to the editable section of your page and claim ownership of your site.

The unique string would be a base-64 encoded SHA-256 hash of a string that is composed of the email address of the owner of the site (for instance, admin@google.com) and the domain name of the site (for instance, google.com).

We'd like your comments on this proposal. For those of you who can't verify now, would this be a method you could use to verify? Do you see any potential problems with this method? Let us know what you think in our Google Group.

Permalink

We've fixed a few things


We've just fixed a few issues that our Google Group members brought to our attention.

Capitalization in robots.txt lines
Our robots.txt analysis tool didn't correctly interpret lines that include capitalized letters. This has been fixed. All results with URLs that include capital letters are being processed correctly. Note that in a robots.txt file, the capitalization of the rules don't matter. For instance, Disallow: and disallow: are interpreted in the same way. However, capitalization of URLs does matter. So, for the following robots.txt file:
User-agent: *
Disallow: /Myfile.html
http://www.example.com/Myfile.html is blocked. But http://www.example.com/myfile.html is not blocked.

Google User-agents other than Googlebot
Our robots.txt analysis tool didn't correctly process robots.txt files for Google user-agents other than Googlebot. This has been fixed. The Google user-agents we provide analysis for are:
  • Googlebot-Mobile: crawls pages for our mobile index
  • Googlebot-Image: crawls pages for our image index
  • Googlebot-MediaPartners: crawls pages to determine AdSense content (used only if you show AdSense ads on your site)
robots.txt files with extra initial characters
Some robots.txt files have extra characters before the start of the first rule. Some text editors place these characters in the file, but you can't see them with the editor. When the tool processed these characters, it reported a syntax error. The tool now mimics Googlebot's behavior and ignores these extra characters.

Capitalization in site URLs
When you add a site, we now convert all the letters in the domain portion of the URL to lowercase, regardless of how you entered the letters. For instance, if you enter http://www.Example.com/, we convert that to http://www.example.com/ in your account. This applies only to the domain, so for instance, if you add http://www.Example.com/MySite/, we will convert it to http://www.example.com/MySite/. If you added sites to your account using capitalized letters, you'll notice the domain portions have been converted to lowercase. We made this minor change as part of our efforts to ensure you see all available stats for your site.

Permalink

Improving your site's indexing and ranking


You've submitted a Sitemap for your site. As we explain in our docs, a Sitemap can help us learn more quickly and comprehensively about the pages of your site than our other methods of crawling, but it doesn't guarantee indexing and has no impact on ranking.

What other things can you do to increase your site's indexing and ranking?

Make sure your site is full of unique, high-quality content.
Google's automated crawling, indexing, and ranking processes are focused on providing quality search results. Is your site a high-quality result for the queries you want to rank highly for? Look at your home page. Does it provide information or does it consist primarily of links? If it is mostly links, where do those links go? Do they lead visitors to good information on your site or simply to more links? Look at your site as a searcher would. If you did a search, would you be happy with your site as a result?

Does your site follow the webmaster guidelines?
Take a close look at your site and our webmaster guidelines. Remember that your site should be meant for visitors, not search engines. It's a good idea to read these guidelines and evaluate your site to make sure it meets them. If it doesn't, your site probably won't be indexed, even if you submit a Sitemap. Here are a few things to check.

Does your site use hidden text?
Hidden text is generally not visible to visitors and is meant to give web-crawling robots, such as Googlebot, different content. For instance, a site might add text in a very small font that is the same color as the page's background. Webmasters sometimes do this because they want to provide more information to the web-crawling robots, and this hidden text is often a list of keywords that the webmaster would like the site to be ranked highly for. Don't use hidden text on your site. Since Google's automated processes are focused in giving searchers high quality results, our guidelines are clear that sites should show Googlebot the same thing they show visitors so our processes can accurately evaluate them.

Does your site use keyword stuffing?
Webmasters sometimes use keyword stuffing much the same way as hidden text. They want to give Googlebot a list of terms that they want their site to rank highly for. But Google's automated processes analyze the contents of a site based on what visitors see, not on a laundry list of keywords. If you want your site to rank highly for particular keywords, make sure your site includes unique, high-quality content related to those keywords.

Does your site buy links from other sites or participate in link exchange programs that don't add value for visitors?
You want other sites to link to you. So, guidelines about links may seem confusing. You want genuine links: another site owner thinks your content is useful and relevant and links to your site. You don't want links that are intended only for Googlebot. For instance, you don't want to pay for a program that spams your link all over the Internet. You don't want to participate in link schemes that require you to link to a bunch of sites you know nothing about in exchange for links on those sites. Do you have hidden links on your site? These are links that visitors can't see and are almost always intended only for search engine web-crawling robots. Think about links in terms of visitors: are the links meant to help them find more good content or are they only meant to attract Googlebot?

Do you use search engine optimization?
If you use a search engine optimization company (SEO), you should also read through our SEO information to make sure that you aren't using one who is unfairly trying to manipulate search engine results.

If your site isn't indexed at all (you can check this by using the site: operator or by logging into your Sitemaps account, accessing the Index stats tab and then clicking the site: link) and you violate these guidelines, you can request reinclusion once you modify your site.

If your site isn't in the index, but doesn't violate these guidelines, there is no need to request reinclusion. Focus on ensuring that your site provides unique, high-quality content for users and submit a Sitemap. Creating a content-rich site is the best way to ensure your site is a high-quality result for search queries you care about and will cause others to link to you naturally.

Permalink

Analyzing a robots.txt file


Earlier this week, we told you about a feature we made available through the Sitemaps program that analyzes the robots.txt file for a site. Here are more details about that feature.

What the analysis means
The Sitemaps robots.txt tool reads the robots.txt file in the same way Googlebot does. If the tool interprets a line as a syntax error, Googlebot doesn't understand that line. If the tool shows that a URL is allowed, Googlebot interprets that URL as allowed.

This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard. It understands Allow: lines, as well as * and $. So while the tool shows lines that include these extensions as understood, remember that this applies only to Googlebot and not necessarily to other bots that may crawl your site.

Subdirectory sites
A robots.txt file is valid only when it's located in the root of a site. So, if you are looking at a site in your account that is located in a subdirectory (such as http://www.example.com/mysite/), we show you information on the robots.txt file at the root (http://www.example.com/robots.txt). You may not have access to this file, but we show it to you because the robots.txt file can impact crawling of your subdirectory site and you may want to make sure it's allowing URLs as you expect.

Testing access to directories
If you test a URL that resolves to a file (such as http://www.example.com/myfile.html), this tool can determine if the robots.txt file allows or blocks that file. If you test a URL that resolves to a directory (such as http://www.example.com/folder1/), this tool can determine if the robots.txt file allows or blocks access to that URL, but it can't tell you about access to the files inside that folder. The robots.txt file may have set restrictions on URLs inside the folder that are different than the URL of the folder itself.

Consider this robots.txt file:
User-Agent: *
Disallow: /folder1/

User-Agent: *
Allow: /folder1/myfile.html
If you test http://www.example.com/folder1/, the tool will say that it's blocked. But if you test http://www.example.com/folder1/myfile.html, you'll see that it's not blocked even though it's located inside of folder1.

Syntax not understood
You might see a "syntax not understood" error for a few different reasons. The most common one is that Googlebot couldn't parse the line. However, some other potential reasons are:
  • The site doesn't have a robots.txt file, but the server returns a status of 200 for pages that aren't found. If the server is configured this way, then when Googlebot requests the robots.txt file, the server returns a page. However, this page isn't actually a robots.txt file, so Googlebot can't process it.
  • The robots.txt file isn't a valid robots.txt file. If Googlebot requests a robots.txt file and receives a different type of file (for instance, an HTML file), this tool won't show a syntax error for every line in the file. Rather, it shows one error for the entire file.
  • The robots.txt file containes a rule that Googlebot doesn't follow. Some user-agents obey rules other than the robots.txt standard. If Googlebot encounters one of the more common additional rules, the tool lists them syntax errors.
Known issues
We are working on a few known issues with the tool, including the way the tool processes capitalization and the analysis with Google user-agents other than Googlebot. We'll keep you posted as we get these issues resolved.

Permalink

From the field


Curious what webmasters in the community are saying about our recent release ? Check out Matt Cutts' blog where he discusses the new robots.txt tool and answers some of your questions.

Permalink

More stats and analysis of robots.txt files


Today, we released new features for Sitemaps.

robots.txt analysis
If the site has a robots.txt file, the new robots.txt tab provides Googlebot's view of that file, including when Googlebot last accessed it, the status it returns, and if it blocks access to your home page. This tab also lists any syntax errors in the file.


You can enter a list of URLs to see if the robots.txt file allows or blocks them.

You can also test changes to your robots.txt file by entering them here and then testing them against the Googlebot user-agent, other Google user agents, or the Robots Standard. This lets you experiment with changes to see how they would impact the crawl of your site, as well as make sure there are no errors in the file, before making changes to the file on your site.

If you don't have a robots.txt file, you can use this page to test a potential robots.txt file before you add it to your site.

More stats
Crawl stats now include the page on your site that had the highest PageRank, by month, for the last three months.


Page analysis now includes a list of the most common words in your site's content and in external links to your site. This gives you additional information about why your site might come up for particular search queries.

Permalink

A chat with the Sitemaps team


Some members of the Sitemaps team recently took some time to answer some questions about Sitemaps. We hope we've given some insight on questions you may have wondered about.

Permalink

Giving others access to Sitemaps account information


Once you add a site or Sitemap to your account and verify ownership, we show you statistics and errors about the site. Now someone else on your team also wants to view information about it. Or perhaps you set up a Sitemap for a client site and now they wants to control the Sitemap submission and view site information.

Anyone who wants to view site information can simply create a Sitemaps Account, add the site (or Sitemap), and verify ownership. The site isn't penalized in any way if it is added to multiple accounts.

Some questions you may have:

I have Sitemaps for multiple clients in my account. I don't want one client to see everything that's in my account. How do I prevent this?
The client should create their own Sitemaps account and verify site ownership. The client won't see anything that's in your account (even though a site or Sitemap may be listed in both accounts).

I've already verified site ownership and the verification file is still on my server. Does the client also have to verify site ownership?
Yes, each account holder must verify site ownership separately. We ask for a different verification file for each account.

What happens if I submit a Sitemap and then someone else in my company submits that same Sitemap using a different Sitemaps account? Do you see that as duplicate entries and penalize the site?
No, we don't penalize the site in any way. No matter the number of accounts that list the Sitemap, we see it as one Sitemap and process it accordingly.

I want the client to be able to see all the stats for this site right away. I don't want them to have to wait for stats to populate. Can I transfer my account information to the client?
Stats and errors are computed for a site, not for an account. There's no need to transfer your account information because anyone who adds a site and verifies site ownership will see all information that we available to show for that site right away.

But I don't want to list the site in my account anymore. Are you sure I can't just transfer the information to the other person's account?
If you don't want the site or Sitemap in your account any longer, you can simply delete it from your account. This won't penalize the site in any way.

Permalink



Copyright © 2005 Google Inc. All rights reserved.
Privacy Policy - Terms of Service