Inside Google Sitemaps: September 2005
Your source for product news and developments All new!We’ve added some new features to Google Sitemaps. Date stamps for statistics For information we provide once you’ve verified your site, we now let you know when we tried to crawl the URLs we tell you about. Enhanced support for special characters in URLs Note that the Sitemap URL must be encoded for readability by the webserver on which it is located. In addition, it can contain only ASCII characters. It can't contain upper ASCII characters, certain control codes, or special characters such as * and {}. If your Sitemap URL contains out-of-range characters, escape them when you submit the URL. Otherwise, you'll receive an error when you try to submit it. You can find more information on escaping out-of-range characters by doing a Google search for [html escape codes]. All URLs must follow the RFC-3986 standard for URIs and the RFC-3987 standard for IRIs. Documentation updates We’ve updated the documentation for these new features, as well as added information about the latest version of the Sitemap Generator script and about OAI-PMH submissions (both of which we talked about in earlier blog posts). We’ve also provided some information about errors you might come across when you submit a Sitemap. All we’ve made these updates in every language for which we provide documentation. Resolved issues And we’ve resolved two issues with this release that you brought to our attention in the Google Group.
If either happen to you, or if you experience any other trouble, please let us know by posting in the Google Group. Several of these features were a direct result of your feedback. Once again, we appreciate your input during our beta period. We show you moreJust about a month ago, Google Sitemaps added new statistics about problems Google encountered crawling your pages. This stats page showed you up to three URLs we had trouble accessing for each type of error. You asked for more. So we’re giving you more. Now, once you verify your site, we’ll show you up to 10 URLs we’ve had trouble accessing for each type of error, for a maximum total of 60 URLs. Keep posting your suggestions to our Google Group and we'll keep listening. Thanks for your participation during our beta period. How is a Google Sitemap different from an HTML sitemap?A Google Sitemap is an XML file that uses the Sitemap protocol. This file lists URLs in your site, along with optional descriptive information about those URLs (such as when they were last updated and how often you modify them). You can create this XML file using our Sitemap Generator or a third-party tool. Google Sitemaps are intended for processing by the Google Sitemaps program. An HTML sitemap is intended for users of your site. Generally, this type of sitemap provides links to the pages in your site, and may provide descriptions of those pages. We encourage the use of HTML sitemaps. They make it easier for users to navigate your site. Also, as we talk about in our webmaster guidelines, a clear hierarchy with text and links helps us index your site. You can’t submit an HTML sitemap to the Google Sitemaps program. However, if you are unable to create or generate a Google Sitemap file in the Sitemap protocol format, you can submit a text file that lists URLs in your site. Using OAI-PMH with Google SitemapsIf your site uses the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) 2.0 protocol, an application-independent interoperability framework based on metadata harvesting, you can use your OAI repository as your Sitemap. Simply submit the baseURL of your OAI repository (for instance, http://www.example.com/oaiserver ). When we query the baseURL , we automatically add query parameters (such as ?verb=Identify or ?verb=ListRecords ), so you can simply submit the baseURL itself. When we extract the URLs for your site, we expect the records in the repository to be formatted using Dublin Core, with the URLs embedded in <dc:identifier> tags. Below is a sample record that includes the <dc:identifier> tag in bold. The URL listed in that tag is what we extract.<oai_dc:dcAs with other Sitemaps, the URLs must be within the same site and at the same directory location or lower than the baseURL . For instance, if you submit http://www.example.com/oaiserver as the baseURL , the following URLs would be valid:http://www.example.com/However, if you submit http://www.example.com/dataprovider/oaiserver , then none of those URLs would be valid.
Combining Sitemaps into one larger SitemapDo you have several small Sitemaps that you would like to combine into one larger one? With version 1.3 of the Sitemap Generator, which we told you about yesterday, you can do just that. This version includes a new input method: To use this input method, locate the The <-- ** MODIFY or DELETE ** This section gives one example. You should replace this example and include an entry for each Sitemap you want to include. Ensure that the <sitemap path="/var/www/docroot/subpath/sitemap*.xml">The Sitemap Generator extracts all URLs and the optional data listed for each URL for every Sitemap you list and creates one Sitemap with this information. At this time, we can't guarantee that this method will work for Sitemaps created with tools other than the Sitemap Generator. Announcing Sitemap Generator version 1.3: Improved encoding supportThe Sitemap Generator version 1.3 is now available and provides improved encoding support. If your webserver uses an encoding other than UTF-8 or if your domain name or some the URLs in your site use non-ASCII characters, and you plan to use the Sitemap Generator to create your Sitemap, you should download this latest version. Generally, non-ASCII URLs should be encoded using UTF-8 before being percent-escaped. However, some webservers respond correctly only if URLs are encoded specifically for the webserver's configuration. All URLs within your Sitemap, as well as the URL of the Sitemap itself, must be encoded for readability by the web server on which they are located. If you are using the Sitemap Generator, you can specify the encoding of the URLs contained in the Sitemap from within the config.xml file. Within the site definition section of that config file, use the optional default_encoding attribute to specify the encoding used by your webserver. If you don't use this attribute and your webserver uses an encoding other than UTF-8, the Sitemap Generator can't know which encoding to use, although it does attempt to determine the correct encoding. If the generated Sitemap doesn't list the URLs correctly, you should explicitly indicate the encoding with the default_encoding attribute and run the Sitemap Generator again.If your URLs contain non-ASCII characters, we recommend that you run the Sitemap Generator script using Python 2.3 or higher. This version of Python has increased non-ASCII support. If your domain name contains non-ASCII characters, you must use Python 2.3 or later, as Internationalizing Domain Names in Applications (IDNA) support wasn't added until this version. Without IDNA support, the Sitemap Generator can't correctly encode a non-ASCII domain name. Google Sitemaps in your languageWe’ve just made our Sitemaps user interface and documentation available in ten additional languages. We have also set up Google Groups for each one. The languages available are: Brazilian Portuguese Dutch French German Italian Korean Russian Simplified Chinese Spanish Traditional Chinese UK English US English If you already use Google in one of these languages, you should see the change automatically. Otherwise, you can click the Preferences link from the Google home page and choose one of these languages from the interface list. As always, you can submit a Sitemap for sites with content in any language. Verifying your site: trouble with 404 pagesYou want to verify your site so you can view additional statistics. You click the verify link beside the site on the My Sitemaps page, create the file we ask for, upload it to your server, and click the Check Status button. And then you see this error message: We've detected that your 404 (file not found) error page returns a status of 200 (OK) in the header. What should you do? This error means that we've detected that your server returns a status of OK when the requested file is not found. This is the same status that the server returns when the file exists. When we look for the verification file, we can't tell if your server is returning a status of OK because it finds the file, or because it can't find the file. This means we are unable to verify your site. Modify your web server configuration to return a status of 404 (file not found) in the header of 404 pages. If your site is hosted, ask your hosting company to do this. Make sure that if your server returns a custom error page when a requested file is not found, that page returns a 404 status in the header. And make sure that the server doesn't redirect requests that return "file not found" to a valid page of your site, such as your home page. This configuration returns a redirect status code (such as 301 or 302) rather than the correct 404 status code. You can read more about http status codes here. If you don't have a mechanism for checking the headers that your server returns, you can do a search for terms such as [check server header tool] to find online tools that will check this for you. Once your web server is configured correctly, try to verify your site again and we'll check the configuration. Copyright © 2005 Google Inc. All rights reserved. |
|