Skip to content Skip to sidebar Skip to footer

File Contain \u00c2\u00a0, Convert To Characters

I have a JSON file which contains text like this .....wax, and voila!\u00c2\u00a0At the moment you can't use our ... My simple question is how CONVERT (not remove) these \u codes

Solution 1:

I have made this crude UTF-8 unmangler, which appears to solve your messed-up encoding situation:

import codecs
import re
import json

defunmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'try:
        return buffer.decode('utf8')           # '€'except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)

Usage:

broken_json = '{"some_key": "... \\u00e2\\u0080\\u0099 w\\u0061x, and voila!\\u00c2\\u00a0\\u00c2\\u00a0At the moment you can\'t use our \\u00e2\\u0082\\u00ac ..."}'
print("Broken JSON\n", broken_json)

converted = re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)

data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])

It uses regex to pick up the hex sequences from your string, converts them to individual bytes and decodes them as UTF-8.

For the sample string above (I've included the 3-byte character as a test) this prints:

Broken JSON
 {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}
Fixed JSON
 {"some_key": "... ’ wax, and voila!  At the moment you can't use our € ..."}
Parsed data
 {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}
Single value
 ... ’ wax, and voila!  At the moment you can't use our € ...

The \xa0 in the "Parsed data" is caused by the way Python outputs dicts to the console, it still is the actual non-breaking space.

Solution 2:

The hacky approach is to remove the outer layer of encoding:

import re
# Assume export is a bytes-like object
export = re.sub(b'\\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)

This matches the escaped UTF-8 bytes and replaces them with actual UTF-8 bytes . Writing the resulting bytes-like object to disk (without further decoding!) should result in a valid UTF-8 JSON file.

Of course this will break if the file contains genuine escaped unicode characters in the UTF-8 range, like \u00e9 for an accented "e".

Solution 3:

As you try to write this in a file named TEST.json, I will assume that this string is a part of a larger json string.

Let me use an full example:

js = '''{"a": "and voila!\\u00c2\\u00a0At the moment you can't use our"}'''print(js)

{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}

I would first load that with json:

x = json.loads(js)
print(x)

{'a': "and voila!Â\xa0At the moment you can't use our"}

Ok, this now looks like an utf-8 string that was wrongly decoded as Latin1. Let us do the reverse operation:

x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])

{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our

Ok, it is now fine and we can convert it back to a correct json string:

print(json.dumps(x))

{"a": "and voila!\\u00a0At the moment you can\'t use our"}

meaning a correctly encoded NO-BREAK SPACE (U+00A0)

TL/DR: what you should do is:

# load the string as json:
js = json.loads(request)

# identify the string values in the json - you probably know how but I don't...
...

# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')

# convert back to a json string
request = json.dumps(js)

Post a Comment for "File Contain \u00c2\u00a0, Convert To Characters"