Free Search Engine Optimization Tool for Yahoo! Sitemap Feed Generation and Notification.Yahoo!'s recent enhancements to their Site Explorer program include the possibility for a webmaster to automatically notify Yahoo! of changes to a site via an RSS XML feed. Webmasters can also verify when Yahoo! last downloaded the sitemap file through the Site Explorer interface.
While the functionality is still underwhelming compared to Google's Webmaster Central, formerly Google Sitemaps, many search engine optimization practitioners will want to start experimenting with Yahoo! Site Explorer feeds in anticipation of bigger things to come.
Update: on 2006-11-16 Yahoo! and Microsoft announced joint collaboration and support for the Sitemaps standard previously introduced by Google.
Yet how can one generate a valid RSS feed for Yahoo! utilizing the same filter rules set up for Google? Facing this quandary, we decided that the easiest approach was to take advantage of the fact that Google released its sitemap generator script as open source code.
We have added extensions to the original Google sitemap generator so that a RSS 2.0 compliant sitemap.rss.gz file is created in addition to the Google specific sitemap.xml.gz when running the Google sitemap generator. While Yahoo! also accepts an Atom 0.3 format, we decided upon RSS 2.0 as the Atom 0.3 was never codified as a standard; it does not appear that Yahoo! accepts Atom 1.0 yet.
We are releasing an alpha version of our extensions in order to gather feedback from as wide an audience as possible. Download the program sitemap_gen_plus_rss.py.gz (22 kb). If you like it, please consider linking to us as a thank you! Please do not link directly to these files; their names will change.
Keep in mind:
Updated 2006-09-07 to properly convert html character entities if present in the original HTML document. This is required by the XML standard.
We have aimed to implement RSS 2.0 output generation with the minimum impact on the existing sitemaps_gen.py file. Where possible, we have kept extensions in separate code blocks to facilitate application of these extensions to future Google releases. We have not worried about performance. Not being versed in python, our code is surely not going to win any elegance awards – contributions are welcome!
Our extension will produce a RSS 2.0 compliant file in addition to the standard Google format as long as you are using the file system crawling mode of sitemaps_gen.py and you have less than 50,000 files. For each html file (based on recognized mime type text/html), we try to get the document title from the title tag and the document description from the meta description tag. We have verified that commentated text is ignored and tags are case insensitive. Non English characters are output using html entities. Automatic Yahoo! notification has been implemented; to use it, insert your application id by substituting the text <insert appid here>, including the less than and greater than braces. This functionality has not been extensively tested, so proceed with caution. If you are already using Google notification and you do not insert your Yahoo! application id, you will probably see an error!
The RSS format requires a document title or description in addition to a link. The RSS format does not contain the priority field specified by Google to assist crawler prioritization.
As document title and description are not part of the Google specification, we need to open each html document which may contain this information, and if it exists, add it to the document item attributes.
Title and description lookup is currently implemented for the crawl mode of URL discovery. It is not yet implemented for the URL list method, but we anticipate this to be rather easy to do and expect to accomplish it in the very near future. To minimize development requirements, we added lookup logic to a point in the existing program where file path locations are processed but before any drop processing, as specified by the configuration file, has occurred. Thus there is extra overhead as dropped file titles and descriptions are retrieved. We anticipate changing this in a future release. Should the program encounter directories without a default index.html or default.html file as the case may be, a warning message is issued. Consider this to be noise if you are dropping these directories in your configuration file.
We use the external Python module elementtree to parse HTML files. This module is unforgiving of malformed html; documents with missing closing tags will be skipped – we issue a warning suggesting the use of HTML tidy to fix the HTML. If HTML files are not recognized by the mime type module, we provide alternative code which processes files by file extension. In this case, the logic must be uncommented, replacing the line if mimetypes.guess_type(rss_filename)[0] == 'text/html':.
We tried using the python internal HTMLParser module introduced in version 2.2 but ran into a problem with html entity translation. Once we resolve this issue, we will replace elementtree with HTMLParser.
If a document, i.e. a txt file, isn't recognized as html, then an title is set, i.e. <title></title>. Alternatively, we could set it to the file name, i.e. <title>filename</title>.
Google supports up to 50,000 urls in a single sitemap. Once the 50,000 limit is reached, sequential sitemaps are created along with a master index.
Currently, sequential RSS files may be generated; we did not test this. If they are generated, they would need to be merged into one RSS file. A wrapper RSS index is NOT generated. In a future release we will investigate large site processing.
Support python 2.2+ HTMLParser. We tried it, but found that html entities in the title field are not maintained. If someone has a quick fix, please let us know.
Support URL list method. We will then make a http call to retrieve document title and description information. This is rather easy to do; we should have it done shortly.
Support sites with more than 50,000 URLs.
Our work has been limited to the Linux operating system, using the UTF-8 character set and python version 2.4.3. You may encounter issues if you use other operating systems, character sets and versions of python. Let us know.
Let us know about your experience with our sitemap_gen.pl extensions. Feel free to provide enhancements.
Last updated: 2006-08-17
Was this resource helpful? If so, feel free to put a link to this page on your site! Just copy this code:
<a href="http://www.antezeta.com/yahoo/sitemap-generator.html"
title="Free software from Antezeta">
Free Yahoo! Site Map Feed Generator for Search Engine Optimization</a>
Bookmark this page with your bookmark service (hover over a logo to see service name):
Link broken? Let us know the correct link!
To better understand the nuances of Search Engine Optimization and Web Marketing, let Antezeta help you with your Search Engine Marketing Needs!
Contact us today to find out more about this topic and the rest of the Web Ecosystem!