Sunday, February 5, 2012

The dreaded Unicode Encode/Decode Errors

About a third of all bug reports I get for PyScripter relate to using PyScripter with python file paths containing  non-ascii characters.  In such cases a UnicodeEncodeError or UnicodeDecodeError may occur when you run or debug a script or even while PyScripter is trying to provide Code Completion.  In this post I will try to provide a description of the problem and solutions.
All strings, filenames and source code included, inside PyScripter are in unicode.  However the Python compiler infrastructure internally uses in various places encoded strings.  You might have thought, as I did, that Python 3k sorted all this mess.  Even in python 3k, the compile function, on Windows, converts unicode filenames into the default locale encoding (the “mbcs” encoding) and back into unicode.  This is unnecessary and a source of many problems.  For more information see my question at StackOverflow and a related python bug report.  In Python 2.x the compile function accepts unicode filenames, but the code object it generates contains an encoded filename.  So, my first advise is:
Don’t use filenames that cannot be encoded in the default locale encoding, i.e. filenames for which filename.encode(“mbcs”) fails.
This is not a PyScripter issue but a python one and there isn’t much one can do about it.
If you work with filenames containing non-ascii characters that can be encoded in the default locale encoding then the situation is as follows:
  1. Working with Python 3k you should not have any problems.  If you do have, then it is a bug and should be reported.
  2. When you work with python 2.x then you need to modify the file site.py in the Lib subdirectory of the python installation path.  In the function setencoding, change the following statement:
  3. if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    to
    if 1:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    Due to a bug in the current versions 2.4.3 and 2.4.6, breakpoints are not recognized even after modifying site.py.  This has now been fixed and the fix will be available with the next release.
    Note that if you have filenames containing, for example, Chinese characters, which work OK in your computer in which the default locale supports Chinese characters, you may have problems when you move your files to a different computer with a default locale not supporting Chinese characters.

No comments: