Skip to content Skip to sidebar Skip to footer

A Better Way To Load Mongodb Data To A Dataframe Using Pandas And Pymongo?

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error. MemoryError: My code looks like this: cursor = tweets.fin

Solution 1:

I've modified my code to the following:

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

Solution 2:

The fastest, and likely most memory-efficient way, to create a DataFrame from a mongodb query, as in your case, would be using monary.

This post has a nice and concise explanation.

Solution 3:

an elegant way of doing it would be as follows:

import pandas as pd
defmy_transform_logic(x):
    if x :
        do_something
        return result

defprocess(cursor):
    df = pd.DataFrame(list(cursor))
    df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))

    #making list off dictionaries
    db.collection_name.insert_many(final_df.to_dict('records'))

    # or update
    db.collection_name.update_many(final_df.to_dict('records'),upsert=True)


#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
    process(cursor)

I tried the above process on a mongoDB collection with 2.6 million records using Joblib on the above code. My code didnt throw any memory errors and the processing finished in 2 hrs.

Solution 4:

The from_recordsclassmethod is probably the best way to do it:

from pandas import pd
import pymongo

client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)

df = pd.DataFrame.from_records(data)

Post a Comment for "A Better Way To Load Mongodb Data To A Dataframe Using Pandas And Pymongo?"