Skip to content Skip to sidebar Skip to footer

Parse Non-standard Semicolon Separated "json"

I have a non-standard 'JSON' file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ; with , because there might be some value containi

Solution 1:

Use the Python tokenize module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP tokens for you to replace:

import tokenize
import json

corrected = []

with open('semi.json', 'r') as semi:
    for token in tokenize.generate_tokens(semi.readline):
        if token[0] == tokenize.OP and token[1] == ';':
            corrected.append(',')
        else:
            corrected.append(token[1])

data = json.loads(''.join(corrected))

This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ] or } allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.

Demo:

>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
...   "client" : "someone";
...   "server" : ["s1"; "s2"];
...   "timestamp" : 1000000;
...   "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> withopen('semi.json', 'r') as semi:
... for token in tokenize.generate_tokens(semi.readline):
... if token[0] == tokenize.OP and token[1] == ';':
...             corrected.append(',')
... else:
...             corrected.append(token[1])
...
>>> print''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}

Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL tokens and the (lineno, start) and (lineno, end) position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.

Solution 2:

You can do some odd things and get it (probably) right.

Because strings on JSON cannot have control chars such as \t, you could replace every ; to \t, so the file will be parsed correctly if your JSON parser is able to load non strict JSON (such as Python's).

After, you only need to convert your data back to JSON so you can replace back all these \t, to ; and use a normal JSON parser to finally load the correct object.

Some sample code in Python:

data = '''{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world"
}'''import json
dec = json.JSONDecoder(strict=False).decode(data.replace(';', '\t,'))
enc = json.dumps(dec)
out = json.loads(dec.replace('\\t,'';'))

Solution 3:

Using a simple character state machine, you can convert this text back to valid JSON. The basic thing we need to handle is to determine the current "state" (whether we are escaping a character, in a string, list, dictionary, etc), and replace ';' by ',' when in a certain state.

I don't know if this is properly way to write it, there is a probably a way to make it shorter, but I don't have enough programming skills to make an optimal version for this.

I tried to comment as much as I could :

deffilter_characters(text):
    # we use this dictionary to match opening/closing tokens
    STATES = {
        '"': '"', "'": "'",
        "{": "}", "[": "]"
    }

    # these two variables represent the current state of the parser
    escaping = False
    state = list()

    # we iterate through each characterfor c in text:
        if escaping:
            # if we are currently escaping, no special treatment
            escaping = Falseelse:
            if c == "\\":
                # character is a backslash, set the escaping flag for the next character
                escaping = Trueelif state and c == state[-1]:
                # character is expected closing token, update state
                state.pop()
            elif c in STATES:
                # character is known opening token, update state
                state.append(STATES[c])
            elif c == ';'and state == ['}']:
                # this is the delimiter we want to change
                c = ','yield c

    assertnot state, "unexpected end of file"deffilter_text(text):
    return''.join(filter_characters(text))

Testing with :

{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
  ...
}

Returns :

{
  "client" : "someone",
  "server" : ["s1"; "s2"],
  "timestamp" : 1000000,
  "content" : "hello; world",
  ...
}

Solution 4:

Pyparsing makes it easy to write a string transformer. Write an expression for the string to be changed, and add a parse action (a parse-time callback) to replace the matched text with what you want. If you need to avoid some cases (like quoted strings or comments), then include them in the scanner, but just leave them unchanged. Then, to actually transform the string, call scanner.transformString.

(It wasn't clear from your example whether you might have a ';' after the last element in one of your bracketed lists, so I added a term to suppress these, since a trailing ',' in a bracketed list is also invalid JSON.)

sample = """
{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
}"""from pyparsing importLiteral, replaceWith, Suppress, FollowedBy, quotedString
import json

SEMI = Literal(";")
repl_semi = SEMI.setParseAction(replaceWith(','))
term_semi = Suppress(SEMI + FollowedBy('}'))
qs = quotedString

scanner = (qs | term_semi | repl_semi)
fixed = scanner.transformString(sample)
print(fixed)
print(json.loads(fixed))

prints:

{
  "client" : "someone",
  "server" : ["s1", "s2"],
  "timestamp" : 1000000,
  "content" : "hello; world"}
{'content': 'hello; world', 'timestamp': 1000000, 'client': 'someone', 'server': ['s1', 's2']}

Post a Comment for "Parse Non-standard Semicolon Separated "json""