Skip to content Skip to sidebar Skip to footer

Optimize Parsing File With Json Objects In Pandas Dataframe, Where Keys May Be Missing In Some Rows

I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines. I have a large file where each line contains valid JSON, with each J

Solution 1:

You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get, which will supply a default parameter when the item isn't found. I created an empty dict called dummy to pass for intermediate gets so that you know a chained get will work.

I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.

import numpy as np
import pandas as pd
import json
import time

defextract_data(data):
    """ convert 1 json dict to records for import"""
    dummy = {}
    jfile = json.loads(data.strip())
    return (
        jfile.get('location', dummy).get('groupe', np.nan), 
        jfile.get('id', np.nan),
        jfile.get('Mother', dummy).get('MotherName', np.nan),
        jfile.get('Father', dummy).get('FatherName', np.nan))

start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
    columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)

## The original way#

start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
withopen ('file.json') as f:
      for chunk in f:
           jfile=json.loads(chunk)
           if'groupe'in jfile['location']:
               groupe=jfile['location']['groupe']
           else:
               groupe=np.nan
           if'id'in jfile:
                id=jfile['id']
           else:
                id=np.nan
           if'MotherName'in jfile['Mother']:
                MotherName=jfile['Mother']['MotherName']
           else:
                MotherName=np.nan
           if'FatherName'in jfile['Father']:
                FatherName=jfile['Father']['FatherName']
           else: 
                FatherName=np.nan
           df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
            ignore_index=True)
print('original', time.time()-start)

Solution 2:

The key part is not to append each row to the dataframe in the loop. You want to keep the collection in a list or dict container and then concatenate all of them at once. You can also simplify your if/else structure with a simple get that returns a default value (e.g. np.nan) if the item is not found in the dictionary.

withopen (path/to/file) as f:
    d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
    for chunk in f:
        jfile = json.loads(chunk)
        d['groupe'].append(jfile['location'].get('groupe', np.nan))
        d['id'].append(jfile.get('id', np.nan))
        d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
        d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))

    df = pd.DataFrame(d)

Post a Comment for "Optimize Parsing File With Json Objects In Pandas Dataframe, Where Keys May Be Missing In Some Rows"