Subsetting Dask Dataframes
Is this a valid way of loading subsets of a dask dataframe to memory: while i < len_df: j = i + batch_size if j > len_df: j = len_df subset = df.loc[i:j
Solution 1:
If your dataset has well known divisions then this might work, but instead I recommend just computing one partition at a time.
for part in df.to_delayed():
subset = part.compute()
You can roughly control the size by repartitioning beforehand
for part in df.repartition(npartitions=100).to_delayed():
subset = part.compute()
This isn't exactly the same, because it doesn't guarantee a fixed number of rows in each partition, but that guarantee might be quite expensive, depending on how the data is obtained.
Post a Comment for "Subsetting Dask Dataframes"