Optimize Parsing File With Json Objects In Pandas Dataframe, Where Keys May Be Missing In Some Rows
Solution 1:
You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record
takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get
, which will supply a default parameter when the item isn't found. I created an empty dict
called dummy
to pass for intermediate get
s so that you know a chained get will work.
I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.
import numpy as np
import pandas as pd
import json
import time
defextract_data(data):
""" convert 1 json dict to records for import"""
dummy = {}
jfile = json.loads(data.strip())
return (
jfile.get('location', dummy).get('groupe', np.nan),
jfile.get('id', np.nan),
jfile.get('Mother', dummy).get('MotherName', np.nan),
jfile.get('Father', dummy).get('FatherName', np.nan))
start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)
## The original way#
start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
withopen ('file.json') as f:
for chunk in f:
jfile=json.loads(chunk)
if'groupe'in jfile['location']:
groupe=jfile['location']['groupe']
else:
groupe=np.nan
if'id'in jfile:
id=jfile['id']
else:
id=np.nan
if'MotherName'in jfile['Mother']:
MotherName=jfile['Mother']['MotherName']
else:
MotherName=np.nan
if'FatherName'in jfile['Father']:
FatherName=jfile['Father']['FatherName']
else:
FatherName=np.nan
df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)
print('original', time.time()-start)
Solution 2:
The key part is not to append each row to the dataframe in the loop. You want to keep the collection in a list or dict container and then concatenate all of them at once. You can also simplify your if/else
structure with a simple get
that returns a default value (e.g. np.nan) if the item is not found in the dictionary.
withopen (path/to/file) as f:
d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
for chunk in f:
jfile = json.loads(chunk)
d['groupe'].append(jfile['location'].get('groupe', np.nan))
d['id'].append(jfile.get('id', np.nan))
d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))
df = pd.DataFrame(d)
Post a Comment for "Optimize Parsing File With Json Objects In Pandas Dataframe, Where Keys May Be Missing In Some Rows"