Fix Linux filename encodings with Python

We recently deployed a web based system to old Red Hat installation still using latin-1 as primary encoding for its filesystem. This caused headscratching as we were using Ubuntu during the development which uses UTF-8 encoding for filenames. Files were copied to the production server using rsync. When ls command was used in a terminal everything seemed be fine, because terminal itself was configured to UTF-8 and thus decoded the filenames when outputted to the terminal. However the web server was unable to serve images containing Finnish ä and ö characters in filenames.

After the headscratching was over we had figured out that renaming filenames to use latin-1 encoding makes the web server serve them fine. If it were a fresh start I’d rather configure the server itself to use now prevailent UTF-8, but because the infrastructure was shared with other, older, projects this was out of the question.

Thus, the script below was created to address the filename encoding problem (syntax highlighting available on the orignal publication):

#!/usr/bin/python
"""

    Recursively fix filename encoding problems

    http://www.opensourcehacker.com, MIT licensed

"""

import os

# Source filename encoding
FROM="utf-8"

# Target filename encoding
TO="latin-1"

# Current working directory
PATH=os.getcwd()

for root, dirs, files in os.walk(PATH):

    # Assume files are 8-bit strings in the native encoding
    # of the system which we cannot know
    for f in files:
        try:
            decoded = f.decode(FROM)
        except UnicodeDecodeError:
            print "Cannot decode:" + f
            continue

        fullpath = os.path.join(root, f)

        newpath = os.path.join(root, decoded.encode(TO))

        if newpath != fullpath:
            print "Renaming:" + fullpath + " to:" + newpath
            os.rename(fullpath, newpath)

I still don’t know how to verify / check / guess which encoding Linux file systems are using (if not UTF-8…), so someone please enlight me. In this case we learnt this by a smart guess and testing: we created two filenames, one name encoded with UTF-8 and one encoded with Latin-1 and see which file the web server was giving for us. I am also unsure whether the fact that files were orignally kept in Subversion repository and committed there from Windows had anything to do with the problem.

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

6 thoughts on “Fix Linux filename encodings with Python

  1. rsync has options to convert the encoding while copying, this might have been easier to use.

  2. > but because the infrastructure was shared with other, older, projects this was out of the question.

    I’m not a pro sysadmin, but for me, encoding is supposed to be a user’s preference and not a system wide one.

    But from what I see on kubuntu 11.04, non-ascii filenames are undesirable for Plone 4.1.

    2011-09-16T19:26:54 ERROR Zope.SiteErrorLog 1316194014.810.788851390806 http://site_url/++resource++static/www/test_%C3%B6.gif

    Module posixpath, line 70, in join
    UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 7: ordinal not in range(128)

    Maybe I get the error because of a symlink or s.th. like that.
    A modern os isn’t a guarantee, but it’s interesting to see it works on your ubuntu installation.

Leave a Reply

Your email address will not be published. Required fields are marked *