Searching arbitary HTML snippets on Plone content pages

Here is an example how to crawl through Plone content to search HTML snippets. This can be done by rendering every content object and check whether certain substrings exists the output HTML This snippet can be executed through-the-web in Zope Management Interface.

This kind of scripting is especially useful if you need to find old links or migrate some text / HTML snippets in the content itself. There might be artifacts which only appear on the resulting pages (portlets, footer texts, etc.) and thus they are invisible to the normal full text search.

# Find arbitary HTML snippets on Plone content pages 

# Collect script output as text/html, so that you can
# call this script conveniently by just typing its URL to a web browser
buffer = ""

# We need to walk through all the content, as the
# links might not be indexed in any search catalog
for brain in context.portal_catalog(): # This queries cataloged brain of every content object
       try:
         obj = brain.getObject()
         # Call to the content object will render its default view and return it as text
         # Note: this will be slow - it equals to load every page from your Plone site
         rendered = obj()
         if "yourtextmatch" in rendered:
               # found old link in the rendered output
               buffer += "Found old links on <a href='%s'>%s</a><br>\n" % (obj.absolute_url(), obj.Title())
       except:
         pass # Something may fail here if the content object is broken

return buffer

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

5 thoughts on “Searching arbitary HTML snippets on Plone content pages

  1. Pingback: Tweets that mention Searching arbitary HTML snippets on Plone content pages | mFabrik - web and mobile development -- Topsy.com

  2. If you know the snippet will only ever be in a content object (rather than a portlet or in your theme) then you can speed this up by looking at contents of the relavent field or setting ajax_load=1 on the request.

  3. By observing Plone’s main_template [1], having a True value on the ‘ajax_load’ request key means some parts of the page aren’t displayed, hence the speed:

    * no css or javascript from tag is loaded

    * nothing from the plone.portaltop ViewletManager, such as the personal bar, searchbox, logo and main menu

    * nothing from the plone.portalfooter ViewletManager, which contains footer and colophon information, site actions and the Analytics javascript calls if you have that configured in your site

    * neither the left nor the right column, along with all the portlets there assigned

    [1] https://github.com/plone/Products.CMFPlone/blob/master/Products/CMFPlone/skins/plone_templates/main_template.pt

Leave a Reply

Your email address will not be published. Required fields are marked *