File Contain \u00c2\u00a0, Convert To Characters
Solution 1:
I have made this crude UTF-8 unmangler, which appears to solve your messed-up encoding situation:
import codecs
import re
import json
defunmangle_utf8(match):
escaped = match.group(0) # '\\u00e2\\u0082\\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'try:
return buffer.decode('utf8') # '€'except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)
Usage:
broken_json = '{"some_key": "... \\u00e2\\u0080\\u0099 w\\u0061x, and voila!\\u00c2\\u00a0\\u00c2\\u00a0At the moment you can\'t use our \\u00e2\\u0082\\u00ac ..."}'
print("Broken JSON\n", broken_json)
converted = re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)
data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])
It uses regex to pick up the hex sequences from your string, converts them to individual bytes and decodes them as UTF-8.
For the sample string above (I've included the 3-byte character €
as a test) this prints:
Broken JSON {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."} Fixed JSON {"some_key": "... ’ wax, and voila! At the moment you can't use our € ..."} Parsed data {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."} Single value ... ’ wax, and voila! At the moment you can't use our € ...
The \xa0
in the "Parsed data" is caused by the way Python outputs dicts to the console, it still is the actual non-breaking space.
Solution 2:
The hacky approach is to remove the outer layer of encoding:
import re
# Assume export is a bytes-like object
export = re.sub(b'\\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)
This matches the escaped UTF-8 bytes and replaces them with actual UTF-8 bytes . Writing the resulting bytes-like object to disk (without further decoding!) should result in a valid UTF-8 JSON file.
Of course this will break if the file contains genuine escaped unicode characters in the UTF-8 range, like \u00e9
for an accented "e".
Solution 3:
As you try to write this in a file named TEST.json
, I will assume that this string is a part of a larger json string.
Let me use an full example:
js = '''{"a": "and voila!\\u00c2\\u00a0At the moment you can't use our"}'''print(js)
{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}
I would first load that with json:
x = json.loads(js)
print(x)
{'a': "and voila!Â\xa0At the moment you can't use our"}
Ok, this now looks like an utf-8 string that was wrongly decoded as Latin1. Let us do the reverse operation:
x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])
{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our
Ok, it is now fine and we can convert it back to a correct json string:
print(json.dumps(x))
{"a": "and voila!\\u00a0At the moment you can\'t use our"}
meaning a correctly encoded NO-BREAK SPACE (U+00A0)
TL/DR: what you should do is:
# load the string as json:
js = json.loads(request)
# identify the string values in the json - you probably know how but I don't...
...
# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')
# convert back to a json string
request = json.dumps(js)
Post a Comment for "File Contain \u00c2\u00a0, Convert To Characters"