2009 August | Open Source Hacker

This post is collection of search engine optimization tips for content management systems, especially for Plone.

1. Do not index query strings

It is often desirable to make sure that query string pages (http://yoursite/page?query_string_action=something) do not end up into the search indexes. Otherwise search bots might index pages like site’s own search engine results (yoursite/search?SearchableText=…) lowering the visibility of actual content pages.

GoogleBot has regex support in robots.txt and can be configured to ignore any URL ? in it. See the example below.

Query string indexing causes the crawler crawl things like

Various search results (?SearchableText)
Keyword lists (?Subject)
Language switching code (?set_language)… making set_language appear as the document in the search results

Also, “almost” human readable query strings look ugly in the address bar…

2. Top level domains and languages

Using top level domain name (.fi for Finland, .uk for United Kingdoms, and so on.) to make distinction between different languages and areas is optimal solution from the SEO point of view. Search engines use TLD information to reorder the search results based on where the search query is performed (there is difference between google.com and google.fi results).

Plone doesn’t use any query strings for content pages. Making robots to ignore query strings is especially important if you are hosting multilingual site and you use top level domain name (TLD) to separate languages: if you don’t configure robots.txt to ignore ?set_language links only one of your top level domains (.com, .fi, .xxx) will get proper visibility in the search results. For example we had situation where our domain www.twinapex.fi did not get proper visibility because Google considered www.twinapex.com?set_language=fi as the primary content source (accessing Finnish content through English site and language switching links).

3. Shared forms

Plone has some forms (send to, login) which can appear on any content page. These must be disallowed or otherwise you might have a search result where the link goes to the form page instead of the actual content page.

4. Hidden content and content excluded from the navigation

Any content excluded from the sitemap navigation should be put under disallowed in robots.txt. E.g. if you check “exclude from navigation” for Plone folder remember to update robots.txt also.

In our case, our internal image bank must not end up being indexed, though images themselves are visible on the site. Otherwise you get funny search result: if you search by person’s name the photo will be the first hit instead of biography.

5. Sitemap protocol

Crawlers use Sitemap protocol to help determining the content pages on your site (note: sitemap seems to be used for hinting only and it is not authoritative). Since version 3.1 Plone can automatically generate sitemap.xml.gz. You still need to register sitemap.xml.gz in Google webmaster tools manually.

There exists a sitemap protocol extension for mobile sites.

6. Webmaster tools

Google Webmaster tools enable you to monitor your site visibility in Google and do some search engine specific tasks like submitting sitemaps.

I do not know what kind of similar functionality other search provides have. Please share your knowledge in the blog comments regarding this.

7. HTML <head> metadata

Search engines mostly ignore <meta> tags besides title so there is no point of trying fine-tune them.

8. Example robots.txt

Here is our optimized robots.txt for www.twinapex.com:

# Normal robots.txt body is purely substring match only
# We exclude lots of general purpose forms which are available in various mount points of the site
# and internal image bank which is hidden in the navigation tree in any case
User-agent: *
Disallow: set_language
Disallow: login_form
Disallow: sendto_form
Disallow: /images

# Googlebot allows regex in its syntax
# Block all URLs including query strings (? pattern) - contentish objects expose query string only for actions or status reports which
# might confuse search results.
# This will also block ?set_language
User-Agent: Googlebot
Disallow: /*?*
Disallow: /*folder_factories$

# Allow Adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

9. Useful resources

How Google treats HTML metatags. This is especially interesting considering Plone’s support for Dublin core metadata (title, description, keywords). Note: <meta> keywords are not indexed.
Plone SEO add-on product
SEO with robots.txt
Sitemaps mobile protocol

Open Source Hacker

Pushing the boundaries of free technology

Monthly Archives: 2009-8

SEO tips: query strings, multiple languages, forms and other content management system issues