SEO tips: query strings, multiple languages, forms and other content management system issues

This post is collection of search engine optimization tips for content management systems, especially for Plone.

1. Do not index query strings

It is often desirable to make sure that query string pages (http://yoursite/page?query_string_action=something) do not end up into the search indexes. Otherwise search bots might index pages like site’s own search engine results  (yoursite/search?SearchableText=…) lowering the visibility of  actual content pages.

GoogleBot has regex support in robots.txt and can be configured to ignore any URL ? in it. See the example below.

Query string indexing causes the crawler crawl things like

  • Various search results (?SearchableText)
  • Keyword lists (?Subject)
  • Language switching code (?set_language)… making set_language appear as the document in the search results

Also, “almost” human readable query strings look ugly in the address bar…

2. Top level domains and languages

Using top level domain name (.fi for Finland, .uk for United Kingdoms, and so on.) to make distinction between different languages and areas is optimal solution from the SEO point of view. Search engines use TLD information to reorder the search results based on where  the search query is performed  (there is difference between google.com and google.fi results).

Plone doesn’t use any query strings for content pages. Making robots to ignore query strings is especially important if you are hosting multilingual site and you use top level domain name (TLD) to separate languages: if you don’t configure robots.txt to ignore ?set_language links only one of your top level domains (.com, .fi, .xxx) will get proper visibility in the search results. For example we had situation where our domain www.twinapex.fi did not get proper visibility because Google considered www.twinapex.com?set_language=fi as the primary content source (accessing Finnish content through English site and  language switching links).

3. Shared forms

Plone has some forms (send to, login) which can appear on any content page. These must be disallowed or otherwise you might have a search result where the link goes to the form page instead of the actual content page.

4. Hidden content and content excluded from the navigation

Any content excluded from the sitemap navigation  should be put under disallowed in robots.txt. E.g. if you check “exclude from navigation” for Plone folder remember to update robots.txt also.

In our case, our internal image bank must not end up being indexed, though images themselves are visible on the site. Otherwise you get funny search result: if you search by person’s name the photo will be the first hit instead of biography.

5. Sitemap protocol

Crawlers use Sitemap protocol to help determining the content pages on your site (note: sitemap seems to be used for hinting only and it is not authoritative).  Since version 3.1 Plone can automatically generate sitemap.xml.gz. You still need to register sitemap.xml.gz in Google webmaster tools manually.

There exists a sitemap protocol extension for mobile sites.

6. Webmaster tools

Google Webmaster tools enable you to monitor your site visibility in Google and do some search engine specific tasks like submitting sitemaps.

I do not know what kind of similar functionality other search provides have. Please share your knowledge in the blog comments regarding this.

7. HTML <head> metadata

Search engines mostly ignore <meta> tags besides title so there is no point of trying fine-tune them.

8. Example robots.txt

Here is our optimized robots.txt for www.twinapex.com:

# Normal robots.txt body is purely substring match only
# We exclude lots of general purpose forms which are available in various mount points of the site
# and internal image bank which is hidden in the navigation tree in any case
User-agent: *
Disallow: set_language
Disallow: login_form
Disallow: sendto_form
Disallow: /images

# Googlebot allows regex in its syntax
# Block all URLs including query strings (? pattern) - contentish objects expose query string only for actions or status reports which
# might confuse search results.
# This will also block ?set_language
User-Agent: Googlebot
Disallow: /*?*
Disallow: /*folder_factories$

# Allow Adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

9. Useful resources

12 thoughts on “SEO tips: query strings, multiple languages, forms and other content management system issues

  1. Very fine job, Mikko.

    Here are some other entries in my Plone site’s robots.txt that serve to exclude “uninteresting” internal pages from the bots and improve canonicalization:

    Disallow: /search_form
    Disallow: /sendto_form
    Disallow: /accessibility-info
    Disallow: /contact-info
    Disallow: /login_form
    Disallow: /mail_password_form?userid=
    Disallow: /news_item
    Disallow: /enabling_cookies
    Disallow: /front-page
    Disallow: /test-folder
    Disallow: /portal_javascripts
    Disallow: /portal_kss
    Disallow: /author
    Disallow: /*talkback
    Disallow: /*RSS
    Disallow: /events_listing
    Disallow: /vcs_view
    Disallow: /ics_view
    Disallow: /events/events-by-date/

    Cheers!

  2. Pingback: SEO Tips

  3. Pingback: SEO - Search Engine Optimization » SEO tips: query strings, multiple languages, forms and other …

  4. KOMTET: That’s brilliant – must be my first text translated to russian 🙂

    Maybe you can tell me about Bing – I haven’t yet find any good articles about how to optimize for it.

  5. Plone SEO (qSEOOptimizer) add-on product allows you to set individual META fields per page. Did you have some more advanced solution in mind?

  6. Yes, we use Plone SEO (qSEOptimizer) for this tasks, but we have some troubles with it… It’s not stable. But i don’t know more powerful solution.

  7. Plone SEO is very good product though it might need some helping hand from the community.

    I suggest you file feature enhancement and bug reports against Plone SEO add-on product. If you are capable to file precise reports Quinta group people and others might help you with the problems.

    Also if you have some very site specific task which is not a very general problem you can override tags in main_template.pt yourself.

  8. If you’re adding ‘login_form’ you probably also want to add ‘login’ to your exclusions, since Plone sends a 302 Moved Temporarily from ‘login’ to ‘login_form’. (Google indexes 302s, but not 301 Moved Permanently.)

Leave a Reply

Your email address will not be published. Required fields are marked *