2011 March | Open Source Hacker

Here is a little script to use Scrapy, a web crawling framework for Python, to search sites for references for certain texts including link content and PDFs. This is handy for cases where you need to find links violating the user policy, trademarks which are not allowed or just to see where your template output is being used. Our Scrapy example differs from a normal search engine as it does HTML source code level checking: you can also search for CSS classes, link targets and other elements which may be invisible for normal search engines.

Scrapy comes with a command-line tool and project skeleton generator. You need to generate your own Scrapy project to where you can then add your own spider classes.

Install Scrapy using Distribute (or setuptools):

easy_install Scrapy

Create project code skeleton:

scrapy startproject myscraper

Add your spider class skeleton by creating a file myscraper/spiders/spiders.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
 """ Crawl through web sites you specify """

 name = "mycrawler"

 # Stay within these domains when crawling
 allowed_domains = ["www.mysite.com"]

 start_urls = [
 "http://www.mysite.com/",
 ]

 # Add our callback which will be called for every found link
 rules = [
   Rule(SgmlLinkExtractor(), follow=True)
 ]

Start Scrapy to test it’s crawling properly. Run the following the top level directoty:

scrapy crawl mycrawler

You should see output like:

2011-03-08 15:25:52+0200 [scrapy] INFO: Scrapy 0.12.0.2538 started (bot: myscraper)
2011-03-08 15:25:52+0200 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2011-03-08 15:25:52+0200 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware

You can hit CTRL+C to interrupt scrapy.

Then let’s enhance the spider a bit to search for a blacklisted tags, with optional whitelisting in myscraper/spiders/spiders.py. We use also pyPdf library to crawl inside PDF files:

"""

        A sample crawler for seeking a text on sites.

"""

import StringIO

from functools import partial

from scrapy.http import Request

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.item import Item

def find_all_substrings(string, sub):
    """
    http://code.activestate.com/recipes/499314-find-all-indices-of-a-substring-in-a-given-string/
    """
    import re
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

class MySpider(CrawlSpider):
    """ Crawl through web sites you specify """

    name = "mycrawler"

    # Stay within these domains when crawling
    allowed_domains = ["www.mysite.com", "www.mysite2.com", "intranet.mysite.com"]

    start_urls = [
        "http://www.mysite.com/",
        "http://www.mysite2.com/",
        "http://intranet.mysite.com/"
    ]

    # Add our callback which will be called for every found link
    rules = [
        Rule(SgmlLinkExtractor(), follow=True, callback="check_violations")
    ]

    # How many pages crawled? XXX: Was not sure if CrawlSpider is a singleton class
    crawl_count = 0

    # How many text matches we have found
    violations = 0

    def get_pdf_text(self, response):
        """ Peek inside PDF to check possible violations.

        @return: PDF content as searcable plain-text string
        """

        try:
                from pyPdf import PdfFileReader
        except ImportError:
                print "Needed: easy_install pyPdf"
                raise 

        stream = StringIO.StringIO(response.body)
        reader = PdfFileReader(stream)

        text = u""

        if reader.getDocumentInfo().title:
                # Title is optional, may be None
                text += reader.getDocumentInfo().title

        for page in reader.pages:
                # XXX: Does handle unicode properly?
                text += page.extractText()

        return text                                      

    def check_violations(self, response):
        """ Check a server response page (file) for possible violations """

        # Do some user visible status reporting
        self.__class__.crawl_count += 1

        crawl_count = self.__class__.crawl_count
        if crawl_count % 100 == 0:
                # Print some progress output
                print "Crawled %d pages" % crawl_count

        # Entries which are not allowed to appear in content.
        # These are case-sensitive
        blacklist = ["meat", "ham" ]

        # Enteries which are allowed to appear. They are usually
        # non-human visible data, like CSS classes, and may not be interesting business wise
        exceptions_after = [ "meatball",
                             "hamming",
                             "hamburg"
                     ]

        # These are predencing string where our match is allowed
        exceptions_before = [
                "bushmeat",
                "honeybaked ham"
        ]

        url = response.url

        # Check response content type to identify what kind of payload this link target is
        ct = response.headers.get("content-type", "").lower()
        if "pdf" in ct:
                # Assume a PDF file
                data = self.get_pdf_text(response)
        else:
                # Assume it's HTML
                data = response.body

        # Go through our search goals to identify any "bad" text on the page
        for tag in blacklist:

                substrings = find_all_substrings(data, tag)

                # Check entries against the exception list for "allowed" special cases
                for pos in substrings:
                        ok = False
                        for exception in exceptions_after:
                                sample = data[pos:pos+len(exception)]
                                if sample == exception:
                                        #print "Was whitelisted special case:" + sample
                                        ok = True
                                        break

                        for exception in exceptions_before:
                                sample = data[pos - len(exception) + len(tag): pos+len(tag) ]
                                #print "For %s got sample %s" % (exception, sample)
                                if sample == exception:
                                        #print "Was whitelisted special case:" + sample
                                        ok = True
                                        break
                        if not ok:
                                self.__class__.violations += 1
                                print "Violation number %d" % self.__class__.violations
                                print "URL %s" % url
                                print "Violating text:" + tag
                                print "Position:" + str(pos)
                                piece = data[pos-40:pos+40].encode("utf-8")
                                print "Sample text around position:" + piece.replace("\n", " ")
                                print "------"

        # We are not actually storing any data, return dummy item
        return Item()

    def _requests_to_follow(self, response):

        if getattr(response, "encoding", None) != None:
                # Server does not set encoding for binary files
                # Do not try to follow links in
                # binary data, as this will break Scrapy
                return CrawlSpider._requests_to_follow(self, response)
        else:
                return []

Let’s tune down logging output level, so we get only relevant data in the output. In myscaper/settings.py add:

LOG_LEVEL="INFO"

Now you can run the crawler and pipe the output to a text file:

scrapy crawl mycrawler > violations.txt

More information

Scrapy manual

$\"\"$ Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

Varnish is a very fast front end cache server. You might want to use it at the front of Apache to speed up loading of your static pages and static media, for example for your WordPress blog. You can also use Varnish backends to multiplex the requests between Plone and Apache based PHP software running on the same server using different backend directives.

However if you wish to use Apache virtual hosts with Varnish there is a trick in it.

We use the following setup

Varnish listens to port 80, HTTP
Apache listens to port 81
Varnish uses Apache as a backend

The related varnish.vcl is

backend backend_apache {
.host = "127.0.0.1";
.port = "81";
}

sub vcl_recv {
 ...
 elsif (req.http.host ~ "^blog.mfabrik.com(:[0-9]+)?$") {
    set req.backend = backend_apache;
 }
 ...
}

Note that the backend IP is 127.0.0.1 (localhost). By default, with Debian or Ubuntu Linux, Apache configuration does not do virtual hosting for this.

So if /etc/apache2/sites-enabled/blog.mfabrik.com looks like:

<VirtualHost *:81>

 ServerName blog.mfabrik.com
 ...
 LogFormat       combined
 TransferLog     /var/log/apache2/blog.mfabrik.com.log

 ...

 ExpiresActive On
 ExpiresByType image/gif A3600
 ExpiresByType image/png A3600
 ExpiresByType image/image/vnd.microsoft.icon A3600
 ExpiresByType image/jpeg A3600
 ExpiresByType text/css A3600
 ExpiresByType text/javascript A3600
 ExpiresByType application/x-javascript A3600
</VirtualHost>

And now the trick – you need to add the following to /etc/apache2/httpd.conf

NameVirtualHost *:81

Unless you do all this, Apache will just pick the first virtualhost file in /etc/apache2/sites-enabled and use it for all requests.

Also you need to edit ports.conf and change Apache to listen to port 81:

Listen 81

$\"\"$ Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

Open Source Hacker

Pushing the boundaries of free technology

Monthly Archives: 2011-3

Installing and using Scrapy web crawler to search text on multiple sites

Sauna Sprint 2011 (サンタスプリント)

Cufon fix for IE9

Setting up Apache virtual hosts behind Varnish