Installing and using Scrapy web crawler to search text on multiple sites

Here is a little script to use Scrapy, a web crawling framework for Python, to search sites for references for certain texts including link content and PDFs. This is handy for cases where you need to find links violating the user policy,  trademarks which are not allowed or just to see where your template output is being used.  Our Scrapy example differs from a normal search engine as it does HTML source code level checking: you can also search for CSS classes, link targets and other elements which may be invisible for normal search engines.

Scrapy comes with a command-line tool and project skeleton generator. You need to generate your own Scrapy project to where you can then add your own spider classes.

Install Scrapy using Distribute (or setuptools):

easy_install Scrapy

Create project code skeleton:

scrapy startproject myscraper

Add your spider class skeleton by creating a file myscraper/spiders/spiders.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
 """ Crawl through web sites you specify """

 name = "mycrawler"

 # Stay within these domains when crawling
 allowed_domains = ["www.mysite.com"]

 start_urls = [
 "http://www.mysite.com/",
 ]

 # Add our callback which will be called for every found link
 rules = [
   Rule(SgmlLinkExtractor(), follow=True)
 ]

Start Scrapy to test it’s crawling properly. Run the following the top level directoty:

scrapy crawl mycrawler

You should see output like:

2011-03-08 15:25:52+0200 [scrapy] INFO: Scrapy 0.12.0.2538 started (bot: myscraper)
2011-03-08 15:25:52+0200 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2011-03-08 15:25:52+0200 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware

You can hit CTRL+C to interrupt scrapy.

Then let’s enhance the spider a bit to search for a blacklisted tags, with optional whitelisting in myscraper/spiders/spiders.py. We use also pyPdf library to crawl inside PDF files:

"""

        A sample crawler for seeking a text on sites.

"""

import StringIO

from functools import partial

from scrapy.http import Request

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.item import Item

def find_all_substrings(string, sub):
    """
    http://code.activestate.com/recipes/499314-find-all-indices-of-a-substring-in-a-given-string/
    """
    import re
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

class MySpider(CrawlSpider):
    """ Crawl through web sites you specify """

    name = "mycrawler"

    # Stay within these domains when crawling
    allowed_domains = ["www.mysite.com", "www.mysite2.com", "intranet.mysite.com"]

    start_urls = [
        "http://www.mysite.com/",
        "http://www.mysite2.com/",
        "http://intranet.mysite.com/"
    ]

    # Add our callback which will be called for every found link
    rules = [
        Rule(SgmlLinkExtractor(), follow=True, callback="check_violations")
    ]

    # How many pages crawled? XXX: Was not sure if CrawlSpider is a singleton class
    crawl_count = 0

    # How many text matches we have found
    violations = 0

    def get_pdf_text(self, response):
        """ Peek inside PDF to check possible violations.

        @return: PDF content as searcable plain-text string
        """

        try:
                from pyPdf import PdfFileReader
        except ImportError:
                print "Needed: easy_install pyPdf"
                raise 

        stream = StringIO.StringIO(response.body)
        reader = PdfFileReader(stream)

        text = u""

        if reader.getDocumentInfo().title:
                # Title is optional, may be None
                text += reader.getDocumentInfo().title

        for page in reader.pages:
                # XXX: Does handle unicode properly?
                text += page.extractText()

        return text                                      

    def check_violations(self, response):
        """ Check a server response page (file) for possible violations """

        # Do some user visible status reporting
        self.__class__.crawl_count += 1

        crawl_count = self.__class__.crawl_count
        if crawl_count % 100 == 0:
                # Print some progress output
                print "Crawled %d pages" % crawl_count

        # Entries which are not allowed to appear in content.
        # These are case-sensitive
        blacklist = ["meat", "ham" ]

        # Enteries which are allowed to appear. They are usually
        # non-human visible data, like CSS classes, and may not be interesting business wise
        exceptions_after = [ "meatball",
                             "hamming",
                             "hamburg"
                     ]

        # These are predencing string where our match is allowed
        exceptions_before = [
                "bushmeat",
                "honeybaked ham"
        ]

        url = response.url

        # Check response content type to identify what kind of payload this link target is
        ct = response.headers.get("content-type", "").lower()
        if "pdf" in ct:
                # Assume a PDF file
                data = self.get_pdf_text(response)
        else:
                # Assume it's HTML
                data = response.body

        # Go through our search goals to identify any "bad" text on the page
        for tag in blacklist:

                substrings = find_all_substrings(data, tag)

                # Check entries against the exception list for "allowed" special cases
                for pos in substrings:
                        ok = False
                        for exception in exceptions_after:
                                sample = data[pos:pos+len(exception)]
                                if sample == exception:
                                        #print "Was whitelisted special case:" + sample
                                        ok = True
                                        break

                        for exception in exceptions_before:
                                sample = data[pos - len(exception) + len(tag): pos+len(tag) ]
                                #print "For %s got sample %s" % (exception, sample)
                                if sample == exception:
                                        #print "Was whitelisted special case:" + sample
                                        ok = True
                                        break
                        if not ok:
                                self.__class__.violations += 1
                                print "Violation number %d" % self.__class__.violations
                                print "URL %s" % url
                                print "Violating text:" + tag
                                print "Position:" + str(pos)
                                piece = data[pos-40:pos+40].encode("utf-8")
                                print "Sample text around position:" + piece.replace("\n", " ")
                                print "------"

        # We are not actually storing any data, return dummy item
        return Item()

    def _requests_to_follow(self, response):

        if getattr(response, "encoding", None) != None:
                # Server does not set encoding for binary files
                # Do not try to follow links in
                # binary data, as this will break Scrapy
                return CrawlSpider._requests_to_follow(self, response)
        else:
                return []

Let’s tune down logging output level, so we get only relevant data in the output. In myscaper/settings.py add:

LOG_LEVEL="INFO"

Now you can run the crawler and pipe the output to a text file:

scrapy crawl mycrawler > violations.txt

More information

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

3 thoughts on “Installing and using Scrapy web crawler to search text on multiple sites

  1. hi..
    i am getting error at crawl_conut =0
    error:
    crawl_count = 0
    ^
    IndentationError: unexpected indent

    code :
    # How many pages crawled? XXX: Was not sure if CrawlSpider is a singleton class
    crawl_count = 0

    i am i missing someting .. i need to know what should i code in items.py

  2. Pingback: Python Web Crawler & Spider Tutorials | Potent Pages | IT Lyderis

  3. Hi ,
    i have a requirement, to search the string in website & print the url that dont have the string.
    can you help me with your script.

    Krishna

Leave a Reply

Your email address will not be published. Required fields are marked *