Sitemap.xml (Sitemaps) is a protocol exposing website pages for the search engines. Beyond hinting crawler bots you can find some creative uses for it.
I was migrating blog posts from one WordPress site to another. I wanted to have the old site still run the front page and few other static pages and I wanted to move the blog posts only. A best practice for content migrations is set HTTP 301 Moved Permantently redirects on the orignal web server so that users and search engines following old links will automatically find the new content.
WordPress sites were and are hosted on Apache web server. Since writing Apache .htaccess redirects by hand is something no living soul wants to do, I wrote a little Python script using Requests (a.k.a. urllib done right) and lxml libraries.
This script will
- Read sitemap.xml.gz file
- Extract page URLs out of it
- (add optional filtering step here)
- Output Apache .htaccess redirect rules so that old links will point to the new site
The beauty of scraping the information from sitemap.xml: Even though this script was written for WordPress it is valid for any CMS supporting Sitemaps protocol. (hint: Plone, Drupal, Joomla!, you name it.. they all support it)
In theory one could run this script in cron so that all new WordPress content gets automatically redirected continuously.
In additional for setting up .htaccess WordPress provides easy export and import tool for blog posts through its admin interface.
And the script:
# Python stdlib from StringIO import StringIO import urlparse # 3rd party import requests from lxml import etree # Sitemap of the source site SITEMAP_URL = "http://mobilizejs.com/wp-content/plugins/simple-google-sitemap-xml/sitemap.xml" # Where posts will be moved NEW_DOMAIN="http://opensourcehacker.com" # Apache .htaccess line template # http://httpd.apache.org/docs/2.0/mod/mod_alias.html#redirectmatch REDIRECT_TEMPLATE="RedirectMatch permanent %(path)s$ %(new_domain)s%(path)s" response = requests.get(SITEMAP_URL) def create_apache_redirect_to_other_domain(url, new_domain): """ Creates .htaccess redirect entry. Create HTTP 301 Permanent Redirect. Output is something along lines:: RedirectMatch permanent /tag/cache-control/$ http://opensourcehacker.com/tag/cache-control/ """ parts = urlparse.urlparse(url) vars = { "new_domain" : new_domain, "path" : parts.path } # Simple Python string template substituion return REDIRECT_TEMPLATE % vars # Parse sitemap tree = etree.parse(StringIO(response.content)) root = tree.getroot() locs = root.xpath("//*[name()='loc']") # Iterate through each sitemap <loc> entry and creating corresponding redirect for loc in locs: url = loc.text redirect_line = create_apache_redirect_to_other_domain(url, NEW_DOMAIN) print redirect_line
Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+
Thanks for this code, I’m using it now to parse sitemaps.
As a sidenote, you can pass a url (http or ftp), file-like object or file path to etree.parse, and it’ll automatically go read that – ie no need for requests. See http://lxml.de/tutorial.html#the-parse-function
But I do agree that requests is a cool library.
Hi Matthys. I didn’t know about the http trick for etree.parse. However there is one benefit of using requests and that is that you are more likely to get human readable error message in the case of HTTP problems 🙂
dgfasdfadfadfadf I didn’t know about the http trick for etree.parse. However there is one benefit of using requests and that is that you are more likely to get human readable error message in the case of HTTP problems