Generating Apache redirects from sitemap.xml for WordPress using Python

Sitemap.xml (Sitemaps) is a protocol exposing website pages for the search engines. Beyond hinting crawler bots you can find some creative uses for it.

I was migrating blog posts from one WordPress site to another. I wanted to have the old site still run the front page and few other static pages and I wanted to move the blog posts only. A best practice for content migrations is set HTTP 301 Moved Permantently redirects on the orignal web server so that users and search engines following old links will automatically find the new content.

WordPress sites were and are hosted on Apache web server. Since writing Apache .htaccess redirects by hand is something no living soul wants to do, I wrote a little Python script using Requests (a.k.a. urllib done right) and lxml libraries.

This script will

  • Read sitemap.xml.gz file
  • Extract page URLs out of it
  • (add optional filtering step here)
  • Output Apache .htaccess redirect rules so that old links will point to the new site

The beauty of scraping the information from sitemap.xml: Even though this script was written for WordPress it is valid for any CMS supporting Sitemaps protocol. (hint: Plone, Drupal, Joomla!, you name it.. they all support it)

In theory one could run this script in cron so that all new WordPress content gets automatically redirected continuously.

In additional for setting up .htaccess WordPress provides easy export and import tool for blog posts through its admin interface.

And the script:

# Python stdlib
from StringIO import StringIO
import urlparse

# 3rd party
import requests
from lxml import etree

# Sitemap of the source site
SITEMAP_URL = "http://mobilizejs.com/wp-content/plugins/simple-google-sitemap-xml/sitemap.xml"

# Where posts will be moved
NEW_DOMAIN="http://opensourcehacker.com"

# Apache .htaccess line template
# http://httpd.apache.org/docs/2.0/mod/mod_alias.html#redirectmatch
REDIRECT_TEMPLATE="RedirectMatch permanent %(path)s$ %(new_domain)s%(path)s"

response = requests.get(SITEMAP_URL)

def create_apache_redirect_to_other_domain(url, new_domain):
    """
    Creates .htaccess redirect entry.

    Create HTTP 301 Permanent Redirect.

    Output is something along lines::

        RedirectMatch permanent /tag/cache-control/$ http://opensourcehacker.com/tag/cache-control/

    """
    parts = urlparse.urlparse(url)

    vars = {
        "new_domain" : new_domain,
        "path" : parts.path          
    }    

    # Simple Python string template substituion
    return REDIRECT_TEMPLATE % vars

# Parse sitemap
tree = etree.parse(StringIO(response.content))
root = tree.getroot()

locs = root.xpath("//*[name()='loc']")

# Iterate through each sitemap <loc> entry and creating corresponding redirect
for loc in locs:
    url = loc.text
    redirect_line = create_apache_redirect_to_other_domain(url, NEW_DOMAIN)
    print redirect_line

 

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

3 thoughts on “Generating Apache redirects from sitemap.xml for WordPress using Python

  1. Thanks for this code, I’m using it now to parse sitemaps.

    As a sidenote, you can pass a url (http or ftp), file-like object or file path to etree.parse, and it’ll automatically go read that – ie no need for requests. See http://lxml.de/tutorial.html#the-parse-function

    But I do agree that requests is a cool library.

  2. Hi Matthys. I didn’t know about the http trick for etree.parse. However there is one benefit of using requests and that is that you are more likely to get human readable error message in the case of HTTP problems 🙂

  3. dgfasdfadfadfadf I didn’t know about the http trick for etree.parse. However there is one benefit of using requests and that is that you are more likely to get human readable error message in the case of HTTP problems

Leave a Reply

Your email address will not be published. Required fields are marked *