Skip to content Skip to sidebar Skip to footer

Correct Way To Add A Column Of Random Numbers To A Dask Dataframe

What is the correct way of adding a column of random numbers to a dask dataframe? I could obviously use map_partitions to add the column to each partition but I'm not sure how the

Solution 1:

According to this discussion (https://github.com/dask/distributed/issues/2558), there's no effort to set/track numpy seed, and the recommended approach is to use dask.array (which was mentioned in the question). Perhaps then the optimal route for reproducible randomness is to create dask.array and convert to dask.dataframe:

import dask.array as da

# this is not reproducible
for _ in range(3):
    x = da.random.random((10, 1), chunks=(2, 2))
    print(x.sum().compute())

# this is reproducible
for _ in range(3):
    state = da.random.RandomState(1234)
    y = state.random(size=(10,1), chunks=(2,2))
    print(y.sum().compute())

# conver to ddf
import dask.dataframe as dd
ddf = dd.from_dask_array(y, columns=['A'])

# if there's another existing dataframe ddf2
ddf2 = dd.from_pandas(pd.DataFrame(range(10), columns=['B']), npartitions=2)
ddf2

# then simple column assignment will work even if partitions are not aligned
ddf2['A'] = ddf['A']
print((ddf.compute() == ddf2[['A']].compute()).sum() == len(ddf))

# of course it will be more efficient to have partitions align
# you can inspect the DAG with ddf2.visualize() to see why
# also note carefully that the lengths of ddf and ddf2 should match
# otherwise there might be unexpected situations downstream
# to see why, try changing the size of `y` above and then compare
# ddf and ddf2

Post a Comment for "Correct Way To Add A Column Of Random Numbers To A Dask Dataframe"