Generating Apache redirects from sitemap.xml for WordPress using Python

Sitemap.xml (Sitemaps) is a protocol exposing website pages for the search engines. Beyond hinting crawler bots you can find some creative uses for it.

I was migrating blog posts from one WordPress site to another. I wanted to have the old site still run the front page and few other static pages and I wanted to move the blog posts only. A best practice for content migrations is set HTTP 301 Moved Permantently redirects on the orignal web server so that users and search engines following old links will automatically find the new content.

WordPress sites were and are hosted on Apache web server. Since writing Apache .htaccess redirects by hand is something no living soul wants to do, I wrote a little Python script using Requests (a.k.a. urllib done right) and lxml libraries.

This script will

  • Read sitemap.xml.gz file
  • Extract page URLs out of it
  • (add optional filtering step here)
  • Output Apache .htaccess redirect rules so that old links will point to the new site

The beauty of scraping the information from sitemap.xml: Even though this script was written for WordPress it is valid for any CMS supporting Sitemaps protocol. (hint: Plone, Drupal, Joomla!, you name it.. they all support it)

In theory one could run this script in cron so that all new WordPress content gets automatically redirected continuously.

In additional for setting up .htaccess WordPress provides easy export and import tool for blog posts through its admin interface.

And the script:

# Python stdlib
from StringIO import StringIO
import urlparse

# 3rd party
import requests
from lxml import etree

# Sitemap of the source site
SITEMAP_URL = "http://mobilizejs.com/wp-content/plugins/simple-google-sitemap-xml/sitemap.xml"

# Where posts will be moved
NEW_DOMAIN="http://opensourcehacker.com"

# Apache .htaccess line template
# http://httpd.apache.org/docs/2.0/mod/mod_alias.html#redirectmatch
REDIRECT_TEMPLATE="RedirectMatch permanent %(path)s$ %(new_domain)s%(path)s"

response = requests.get(SITEMAP_URL)

def create_apache_redirect_to_other_domain(url, new_domain):
    """
    Creates .htaccess redirect entry.

    Create HTTP 301 Permanent Redirect.

    Output is something along lines::

        RedirectMatch permanent /tag/cache-control/$ http://opensourcehacker.com/tag/cache-control/

    """
    parts = urlparse.urlparse(url)

    vars = {
        "new_domain" : new_domain,
        "path" : parts.path          
    }    

    # Simple Python string template substituion
    return REDIRECT_TEMPLATE % vars

# Parse sitemap
tree = etree.parse(StringIO(response.content))
root = tree.getroot()

locs = root.xpath("//*[name()='loc']")

# Iterate through each sitemap <loc> entry and creating corresponding redirect
for loc in locs:
    url = loc.text
    redirect_line = create_apache_redirect_to_other_domain(url, NEW_DOMAIN)
    print redirect_line

 

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

Javascript startsWith() and check the domain of executed script using window.location.hostname

Just to remind myself… there is no String.startsWith() in Javascript. This bite me once again… Instead you do a regular expression match from the beginning of the string.

To find out whether your Javascript is being executed on a certain domain you do:

if(window.location.hostname).match("^opensourcehacker.com")) {
   // yes this is our domain
} else {
   // subdomain (might be www.!) or some else domain
}

Yours sincerely, The regex hater

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

Purging Varnish cache from Python web application

Varnish is a popular front end caching server. If you are using it at the front of your web site you have situations where you wish to force the cached content out of the cache E.g. site pages have been updated.

Below is an example how to create an action to purge selected pages from the Varnish cache. The example shows how to purge the whole cache, but you can use regular expressions to limit the effect of the purge.

This tutorial has been written for Ubuntu 8.04, Python 2.6, Varnish 2.x and Plone 4.x. However, it should be trivial to apply the instructions for other systems.

We use Requests library which is a great improvement over Python stdlib’s infamous urllib. After you try Request you never want to use urllib or curl again.

Configuring Varnish

First you need to allow HTTP PURGE request handler in default.vcl from localhost. We’ll create a special PURGE command which takes URLs to be purged out of the cache in a special header:

acl purge_hosts {
        "localhost";
        # XXX: Add your local computer public IP here if you
        # want to test the code against the production server
        # from the development instance
}

...

sub vcl_recv {

        ...

        # PURGE requests call purge() with a regular expression payload from
        if (req.request == "PURGE") {
                # Check IP for access control list
                if (!client.ip ~ purge_hosts) {
                        error 405 "Not allowed.";
                }
                # Purge for the current host using reg-ex from X-Purge-Regex header
                purge("req.http.host == " req.http.host " && req.url ~ " req.http.X-Purge-Regex);
                error 200 "Purged.";
        }
}

Creating a view

Then let’s create a Plone view which will make a request from Plone to Varnish (upstream localhost:80) and issue PURGE command. We do this using Requests Python lib.

This is useful when very fine tuned per page purgingn fails and the site admins simply want to make sure that everything from the front end cache will be gone.

This view is created so that it can be hooked in portal_actions menu and only the site admins have permission to call it.

Example view code:

import requests

from Products.CMFCore.interfaces import ISiteRoot
from five import grok

# XXX: The followig monkey-patch is unneeded soon
# as the author updated Requets library to handle this case

from requests.models import Request

if not "PURGE" in Request._METHODS:
    lst = list(Request._METHODS)
    lst.append("PURGE")
    Request._METHODS = tuple(lst)

class Purge(grok.CodeView):
    """
    Purge upstream cache from all entries.

    This is ideal to hook up for admins e.g. through portal_actions menu.

    You can access it as admin:: http://site.com/@@purge
    """

    grok.context(ISiteRoot)

    # Onlyl site admins can use this
    grok.require("cmf.ManagePortal")

    def render(self):
        """
        Call the parent cache using Requets Python library and issue PURGE command for all URLs.

        Pipe through the response as is.
        """

        site_url = "http://www.site.com/"

        # We are cleaning everything, but we could limit the effect of purging here
        headers = {
                   # Match all pages
                   "X-Purge-Regex" : ".*"
        }

        resp = requests.request("PURGE", site_url, headers=headers)

        self.request.response["Content-type"] = "text/plain"
        text = []

        text.append("HTTP " + str(resp.status_code))

        # Dump response headers as is to the Plone user,
        # so he/she can diagnose the problem
        for key, value in resp.headers.items():
            text.append(str(key) + ": " + str(value))

        # Add payload message from the server (if any)

        text.append(str(resp.body))

        return "\n".join(text)

Now if you are a Plone site admin and visit in URL:

http://site.com/@@purge

You’ll clear the front end cache.

More info

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+