Inside Google Sitemaps: February 2006
Your source for product news and developments Come by and say hiThis week, some of the Sitemaps team will be at Search Engine Strategies NYC. If you're planning to be there on Monday, come have lunch with us and Matt Cutts. We'll be around throughout the conference and we'd love to talk to you. Please come over and say hi if you see us. Also, last week, I talked all things Sitemaps with GoodROI on his show GoodKarma. Check out the podcast on Webmaster Radio. Using a robots.txt fileA couple of weeks ago, we launched a robots.txt analysis tool. This tool gives you information about how Googlebot interprets your robots.txt file. You can read more about the robots.txt Robots Exclusion Standard, but we thought we'd answer some common questions here. What is a robots.txt file? Does my site need a robots.txt file? Where should the robots.txt file be located? How do I create a robots.txt file? What should the syntax of my robots.txt file be?
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines in one entry.> User-Agent User-Agent: * Disallow
URLs are case-sensitive. For instance, Disallow: /private_file.html would block http://www.example.com/private_file.html, but would allow http://www.example.com/Private_File.html. How do I block Googlebot?
Can I allow pages? You may want to use Disallow and Allow together. For instance, to block access to all pages in a subdirectory except one, you could use the following entries: User-Agent: Googlebot Those entries would block all pages inside the folder1 directory except for myfile.html. I don't want certain pages of my site to be indexed, but I want to show AdSense ads on those pages. Can I do that? User-Agent: * I don't want to list every file that I want to block. Can I use pattern matching? Matching a sequence of characters User-Agent: Googlebot How can I make sure that my file blocks and allows what I want it to?
Also, if you don't currently use a robots.txt file, you can create one and then test it with the tool before you upload it to your site. If I change my robots.txt file or upload a new one, how soon will it take affect? We'd like your feedback on a potential new verification processBefore we show you site stats, we verify site ownership by asking you to upload a uniquely named file. Some webmasters can't do this (for instance, they can't upload files or they can't choose filenames). We are considering adding an alternate verification option and would like your feedback. This option would be an additional choice. It would not replace the current verification process. If you have already verified your site, you won't need to do anything new. This alternate verification process would require that you place a META tag in the <HEAD> section of your home page. This META tag would contain a text string unique to your Sitemaps account and site. We would provide you with this tag, which would look something like this: <meta name="owner-v1" contents="unique_string"> We would check for this META tag in the first <HEAD> section of the page, before the first <BODY> section. We would do this so that if your home page is editable (for instance, is a wiki-type page or a blog with comments), someone could not add this META tag to the editable section of your page and claim ownership of your site. The unique string would be a base-64 encoded SHA-256 hash of a string that is composed of the email address of the owner of the site (for instance, admin@google.com) and the domain name of the site (for instance, google.com). We'd like your comments on this proposal. For those of you who can't verify now, would this be a method you could use to verify? Do you see any potential problems with this method? Let us know what you think in our Google Group. We've fixed a few thingsWe've just fixed a few issues that our Google Group members brought to our attention. Capitalization in robots.txt lines Our robots.txt analysis tool didn't correctly interpret lines that include capitalized letters. This has been fixed. All results with URLs that include capital letters are being processed correctly. Note that in a robots.txt file, the capitalization of the rules don't matter. For instance, Disallow: and disallow: are interpreted in the same way. However, capitalization of URLs does matter. So, for the following robots.txt file: User-agent: *http://www.example.com/Myfile.html is blocked. But http://www.example.com/myfile.html is not blocked. Google User-agents other than Googlebot Our robots.txt analysis tool didn't correctly process robots.txt files for Google user-agents other than Googlebot. This has been fixed. The Google user-agents we provide analysis for are:
Some robots.txt files have extra characters before the start of the first rule. Some text editors place these characters in the file, but you can't see them with the editor. When the tool processed these characters, it reported a syntax error. The tool now mimics Googlebot's behavior and ignores these extra characters. Capitalization in site URLs When you add a site, we now convert all the letters in the domain portion of the URL to lowercase, regardless of how you entered the letters. For instance, if you enter http://www.Example.com/, we convert that to http://www.example.com/ in your account. This applies only to the domain, so for instance, if you add http://www.Example.com/MySite/, we will convert it to http://www.example.com/MySite/. If you added sites to your account using capitalized letters, you'll notice the domain portions have been converted to lowercase. We made this minor change as part of our efforts to ensure you see all available stats for your site. Improving your site's indexing and rankingYou've submitted a Sitemap for your site. As we explain in our docs, a Sitemap can help us learn more quickly and comprehensively about the pages of your site than our other methods of crawling, but it doesn't guarantee indexing and has no impact on ranking. What other things can you do to increase your site's indexing and ranking? Make sure your site is full of unique, high-quality content. Google's automated crawling, indexing, and ranking processes are focused on providing quality search results. Is your site a high-quality result for the queries you want to rank highly for? Look at your home page. Does it provide information or does it consist primarily of links? If it is mostly links, where do those links go? Do they lead visitors to good information on your site or simply to more links? Look at your site as a searcher would. If you did a search, would you be happy with your site as a result? Does your site follow the webmaster guidelines? Take a close look at your site and our webmaster guidelines. Remember that your site should be meant for visitors, not search engines. It's a good idea to read these guidelines and evaluate your site to make sure it meets them. If it doesn't, your site probably won't be indexed, even if you submit a Sitemap. Here are a few things to check. Does your site use hidden text? Hidden text is generally not visible to visitors and is meant to give web-crawling robots, such as Googlebot, different content. For instance, a site might add text in a very small font that is the same color as the page's background. Webmasters sometimes do this because they want to provide more information to the web-crawling robots, and this hidden text is often a list of keywords that the webmaster would like the site to be ranked highly for. Don't use hidden text on your site. Since Google's automated processes are focused in giving searchers high quality results, our guidelines are clear that sites should show Googlebot the same thing they show visitors so our processes can accurately evaluate them. Does your site use keyword stuffing? Webmasters sometimes use keyword stuffing much the same way as hidden text. They want to give Googlebot a list of terms that they want their site to rank highly for. But Google's automated processes analyze the contents of a site based on what visitors see, not on a laundry list of keywords. If you want your site to rank highly for particular keywords, make sure your site includes unique, high-quality content related to those keywords. Does your site buy links from other sites or participate in link exchange programs that don't add value for visitors? You want other sites to link to you. So, guidelines about links may seem confusing. You want genuine links: another site owner thinks your content is useful and relevant and links to your site. You don't want links that are intended only for Googlebot. For instance, you don't want to pay for a program that spams your link all over the Internet. You don't want to participate in link schemes that require you to link to a bunch of sites you know nothing about in exchange for links on those sites. Do you have hidden links on your site? These are links that visitors can't see and are almost always intended only for search engine web-crawling robots. Think about links in terms of visitors: are the links meant to help them find more good content or are they only meant to attract Googlebot? Do you use search engine optimization? If you use a search engine optimization company (SEO), you should also read through our SEO information to make sure that you aren't using one who is unfairly trying to manipulate search engine results. If your site isn't indexed at all (you can check this by using the site: operator or by logging into your Sitemaps account, accessing the Index stats tab and then clicking the site: link) and you violate these guidelines, you can request reinclusion once you modify your site. If your site isn't in the index, but doesn't violate these guidelines, there is no need to request reinclusion. Focus on ensuring that your site provides unique, high-quality content for users and submit a Sitemap. Creating a content-rich site is the best way to ensure your site is a high-quality result for search queries you care about and will cause others to link to you naturally. Analyzing a robots.txt fileEarlier this week, we told you about a feature we made available through the Sitemaps program that analyzes the robots.txt file for a site. Here are more details about that feature. What the analysis means The Sitemaps robots.txt tool reads the robots.txt file in the same way Googlebot does. If the tool interprets a line as a syntax error, Googlebot doesn't understand that line. If the tool shows that a URL is allowed, Googlebot interprets that URL as allowed. This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard. It understands Allow: lines, as well as * and $. So while the tool shows lines that include these extensions as understood, remember that this applies only to Googlebot and not necessarily to other bots that may crawl your site. Subdirectory sites A robots.txt file is valid only when it's located in the root of a site. So, if you are looking at a site in your account that is located in a subdirectory (such as http://www.example.com/mysite/), we show you information on the robots.txt file at the root (http://www.example.com/robots.txt). You may not have access to this file, but we show it to you because the robots.txt file can impact crawling of your subdirectory site and you may want to make sure it's allowing URLs as you expect. Testing access to directories If you test a URL that resolves to a file (such as http://www.example.com/myfile.html), this tool can determine if the robots.txt file allows or blocks that file. If you test a URL that resolves to a directory (such as http://www.example.com/folder1/), this tool can determine if the robots.txt file allows or blocks access to that URL, but it can't tell you about access to the files inside that folder. The robots.txt file may have set restrictions on URLs inside the folder that are different than the URL of the folder itself. Consider this robots.txt file: User-Agent: *If you test http://www.example.com/folder1/, the tool will say that it's blocked. But if you test http://www.example.com/folder1/myfile.html, you'll see that it's not blocked even though it's located inside of folder1. Syntax not understood You might see a "syntax not understood" error for a few different reasons. The most common one is that Googlebot couldn't parse the line. However, some other potential reasons are:
We are working on a few known issues with the tool, including the way the tool processes capitalization and the analysis with Google user-agents other than Googlebot. We'll keep you posted as we get these issues resolved. From the fieldCurious what webmasters in the community are saying about our recent release ? Check out Matt Cutts' blog where he discusses the new robots.txt tool and answers some of your questions. More stats and analysis of robots.txt filesToday, we released new features for Sitemaps. robots.txt analysis If the site has a robots.txt file, the new robots.txt tab provides Googlebot's view of that file, including when Googlebot last accessed it, the status it returns, and if it blocks access to your home page. This tab also lists any syntax errors in the file. ![]() You can enter a list of URLs to see if the robots.txt file allows or blocks them. ![]() You can also test changes to your robots.txt file by entering them here and then testing them against the Googlebot user-agent, other Google user agents, or the Robots Standard. This lets you experiment with changes to see how they would impact the crawl of your site, as well as make sure there are no errors in the file, before making changes to the file on your site. If you don't have a robots.txt file, you can use this page to test a potential robots.txt file before you add it to your site. More stats Crawl stats now include the page on your site that had the highest PageRank, by month, for the last three months. ![]() Page analysis now includes a list of the most common words in your site's content and in external links to your site. This gives you additional information about why your site might come up for particular search queries. ![]() A chat with the Sitemaps teamSome members of the Sitemaps team recently took some time to answer some questions about Sitemaps. We hope we've given some insight on questions you may have wondered about. Giving others access to Sitemaps account informationOnce you add a site or Sitemap to your account and verify ownership, we show you statistics and errors about the site. Now someone else on your team also wants to view information about it. Or perhaps you set up a Sitemap for a client site and now they wants to control the Sitemap submission and view site information. Anyone who wants to view site information can simply create a Sitemaps Account, add the site (or Sitemap), and verify ownership. The site isn't penalized in any way if it is added to multiple accounts. Some questions you may have: I have Sitemaps for multiple clients in my account. I don't want one client to see everything that's in my account. How do I prevent this? The client should create their own Sitemaps account and verify site ownership. The client won't see anything that's in your account (even though a site or Sitemap may be listed in both accounts). I've already verified site ownership and the verification file is still on my server. Does the client also have to verify site ownership? Yes, each account holder must verify site ownership separately. We ask for a different verification file for each account. What happens if I submit a Sitemap and then someone else in my company submits that same Sitemap using a different Sitemaps account? Do you see that as duplicate entries and penalize the site? No, we don't penalize the site in any way. No matter the number of accounts that list the Sitemap, we see it as one Sitemap and process it accordingly. I want the client to be able to see all the stats for this site right away. I don't want them to have to wait for stats to populate. Can I transfer my account information to the client? Stats and errors are computed for a site, not for an account. There's no need to transfer your account information because anyone who adds a site and verifies site ownership will see all information that we available to show for that site right away. But I don't want to list the site in my account anymore. Are you sure I can't just transfer the information to the other person's account? If you don't want the site or Sitemap in your account any longer, you can simply delete it from your account. This won't penalize the site in any way. Copyright © 2005 Google Inc. All rights reserved. |
|