How To Save Dask Dataframe To Parquet On Same Machine As Dask Sheduler/workers?
Solution 1:
Usually to save a dask dataframe as a parquet dataset people do the following:
df.to_parquet(...)
From your question it sounds like your workers may not all have access to a shared file system like NFS, or S3. If this is the case and you store to local drives then your data will be scattered on various machines without an obvious way to collect them together. In principle, I encouarage you to avoid this, and invest in a shared file system. They are very helpful when doing distributed computing.
If you can't do that then I personally would probably write in parallel to local drives and then scp them over back to one machine afterwards.
If your dataset is small enough then you could also call .compute
to get back to a local Pandas dataframe and then write that using Pandas
df.compute().to_parquet(...)
Post a Comment for "How To Save Dask Dataframe To Parquet On Same Machine As Dask Sheduler/workers?"