Skip to content Skip to sidebar Skip to footer

How To Save Dask Dataframe To Parquet On Same Machine As Dask Sheduler/workers?

I'm trying to save by Dask Dataframe to parquet on the same machine as the dask scheduler/workers are located. However, I have trouble during this. My Dask setup: My python script

Solution 1:

Usually to save a dask dataframe as a parquet dataset people do the following:

df.to_parquet(...)

From your question it sounds like your workers may not all have access to a shared file system like NFS, or S3. If this is the case and you store to local drives then your data will be scattered on various machines without an obvious way to collect them together. In principle, I encouarage you to avoid this, and invest in a shared file system. They are very helpful when doing distributed computing.

If you can't do that then I personally would probably write in parallel to local drives and then scp them over back to one machine afterwards.

If your dataset is small enough then you could also call .compute to get back to a local Pandas dataframe and then write that using Pandas

df.compute().to_parquet(...)

Post a Comment for "How To Save Dask Dataframe To Parquet On Same Machine As Dask Sheduler/workers?"