Two Seemingly Identical Unicode Strings Turn Out To Be Different When Using Repr(), But How Can I Fix This?
Solution 1:
Some Unicode characters can be specified different ways, as you've discovered, either as a single codepoint or as a regular codepoint plus a combining codepoint. The character \u0300
is a COMBINING GRAVE ACCENT, which adds an accent mark to the preceding character.
The process of fixing a string to a common representation is called normalization. You can use the unicodedata
module to do this:
defn(str):
return unicodedata.normalize('NFKC', str)
>>> n(u'ch\xe0o') == n(u'cha\u0300o')
True
Solution 2:
The problem seems to be in an ambiguous representation of grave accents in unicode. Here is LATIN SMALL LETTER A WITH GRAVE and here is COMBINING GRAVE ACCENT which when combined with 'a' becomes more or less the exact same character as the first. So two representations of the same character. In fact unicode has a term for this: unicode equivalence.
To implement this in python, use unicodedata.normalize on the string before comparing. I tried 'NFC' mode which returns u'ch\xe0o' for both strings.
Post a Comment for "Two Seemingly Identical Unicode Strings Turn Out To Be Different When Using Repr(), But How Can I Fix This?"