How To Remove Last Utf8 Char Of A Python String
I have a string containing utf-8 encoded text. I need to remove the last utf-8 character. So far I did msg = msg[:-1] but this only removes the last byte. It works as long as th
Solution 1:
The simplest way is to decode your UTF-8 bytes to Unicode text:
without_last = msg.decode('utf8')[:-1]
You can always encode it again.
The alternative would be for you to search for a UTF-8 start byte; UTF-8 byte sequences always start with a byte with the most significant bit set to 0
, or the two most significant bits set to 1
, while continuation bytes always start with 10
:
# find starting byte of last codepointpos = len(msg) - 1whilepos > -1andord(msg[pos]) & 0xC0 == 0x80:
# character at pos is a continuation byte (bit 7 set, bit 6 not)pos -= 1
msg = msg[:pos]
Post a Comment for "How To Remove Last Utf8 Char Of A Python String"