Skip to content Skip to sidebar Skip to footer

Subsetting Dask Dataframes

Is this a valid way of loading subsets of a dask dataframe to memory: while i < len_df: j = i + batch_size if j > len_df: j = len_df subset = df.loc[i:j

Solution 1:

If your dataset has well known divisions then this might work, but instead I recommend just computing one partition at a time.

for part in df.to_delayed():
    subset = part.compute()

You can roughly control the size by repartitioning beforehand

for part in df.repartition(npartitions=100).to_delayed():
    subset = part.compute()

This isn't exactly the same, because it doesn't guarantee a fixed number of rows in each partition, but that guarantee might be quite expensive, depending on how the data is obtained.

Post a Comment for "Subsetting Dask Dataframes"