Sunday, February 5, 2012

Python unicode string gotcha

Consider the following executed in python’s 2.7 interactive interpreter:

>>> s = u'Ω'
>>> se = s.encode("mbcs")
>>> print s, se
Ω Ω
>>> s == se
True
>>> print s.lower(), se.lower()
ω Ω
>>> s.lower() == se.lower()
False

Bizarre?  Not if you consider that an ansi string has no way of knowing its encoding.  Of course it could try to use the default encoding, but clearly it doesn’t.  Python’s str.lower() does not convert non-ascii characters.  See also a related question at Stackoverflow.

So what all these have to do with PyScripter?  In my previous post,I mentioned that breakpoints did not work in python 2.x when the filename contained non-ascii chars.  For the curious here is what was happening:

  • PyScripter passes unicode filenames to compile, which returns code objects with an ansi filenames.
  • PyScripter passes to the the debugger’s Bdb.set_break the same unicode filenames.
  • The debugger stores breakpoints with filenames converted through os.path.normcase. This function on Windows converts filenames to lowercase.  Since PyScripter passes unicode filenames, the filenames are properly converted to lower case
  • When the debugger checks whether we hit a breakpoint, it uses the frame’s filename, coming from the code object’s filename, which is an ansi string (str).  It converts the filename to lowercase again using the os.path.normcase but now non-ascii chars are not properly converted.
  • The debugger then compares the frame’s filename with the filenames of the stored breakpoints.
  • And as in the code snippet above the lowercase filenames do not match, since unicode.lower() and str.lower() behave differently on non-ascii characters.

As mentioned in the previous post, this has now been fixed.

No comments: