Sample Rows Of Pandas Dataframe In Proportion To Counts In A Column

March 31, 2024 Post a Comment

I have a large pandas dataframe with about 10,000,000 rows. Each one represents a feature vector. The feature vectors come in natural groups and the group label is in a column ca

Solution 1:

You can use groupby and sample

sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))

Solution 2:

the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:

df = pd.DataFrame(dict(
    A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
    B=range(20)
))

Short and sweet:

df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)

Long version

df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)

Solution 3:

I was looking for similar solution. The code provided by @Vaishali works absolutely fine. What @Abdou's trying to do also makes sense when we want to extract samples from each group based on their proportions to the full data.

# original : 10% from each groupsample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))

# modified : sample size based on proportions of group sizen = df.shape[0]
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=length(x)/n))

Solution 4:

This is not as simple as just grouping and using .sample. You need to actually get the fractions first. Since you said that you are looking to grab 10% of the total numbers of rows in different proportions, you will need to calculate how much each group will have to take out from the main dataframe. For instance, if we use the divide you mentioned in the question, then group A will end up with 1/20 for a fraction of the total number of rows, group B will get 1/30 and group C ends up with 1/60. You can put these fractions in a dictionary and then use .groupby and pd.concat to concatenate the number of rows* from each group into a dataframe. You will be using the n parameter from the .sample method instead of the frac parameter.

fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
N = len(df)
pd.concat(dff.sample(n=int(fracs.get(i)*N)) for i,dff in df.groupby('group_id'))

Edit:

This is to highlight the importance in fulfilling the requirement that group_id A should have half of the sampled rows, group_id B two sixths of the sampled rows and group_id C one sixth of the sampled rows, regardless of the original group divides.

Starting with equal portions: each group starts with 40 rows

df1 = pd.DataFrame({'group_id': ['A','B', 'C']*40,
                   'vals': np.random.randn(120)})
N = len(df1)
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df1.groupby('group_id')))

#     group_id      vals# 12         A -0.175109# 51         A -1.936231# 81         A  2.057427# 111        A  0.851301# 114        A  0.669910# 60         A  1.226954# 73         B -0.166516# 82         B  0.662789# 94         B -0.863640# 31         B  0.188097# 101        C  1.802802# 53         C  0.696984print(df1.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals# group_id# A        24         A  0.161328#          21         A -1.399320#          30         A -0.115725#          114        A  0.669910# B        34         B -0.348558#          7          B -0.855432#          106        B -1.163899#          79         B  0.532049# C        65         C -2.836438#          95         C  1.701192#          80         C -0.421549#          74         C -1.089400

First solution: 6 rows for group A (1/2 of the sampled rows), 4 rows for group B (one third of the sampled rows) and 2 rows for group C (one sixth of the sampled rows).

Second solution: 4 rows for each group (each one third of the sampled rows)

Working with differently sized groups: 40 for A, 60 for B and 20 for C

df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
                   'vals': np.random.randn(120)})
N = len(df2)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df2.groupby('group_id')))

#     group_id      vals# 29         A  0.306738# 35         A  1.785479# 21         A -0.119405# 4          A  2.579824# 5          A  1.138887# 11         A  0.566093# 80         B  1.207676# 41         B -0.577513# 44         B  0.286967# 77         B  0.402427# 103        C -1.760442# 114        C  0.717776print(df2.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals# group_id# A        4          A  2.579824#          32         A  0.451882#          5          A  1.138887#          17         A -0.614331# B        47         B -0.308123#          52         B -1.504321#          42         B -0.547335#          84         B -1.398953#          61         B  1.679014#          66         B  0.546688# C        105        C  0.988320#          107        C  0.698790

First solution: consistent Second solution: Now group B has taken 6 of the sampled rows when it's supposed to only take 4.

Working with another set of differently sized groups: 60 for A, 40 for B and 20 for C

df3 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (60, 40, 20)),
                   'vals': np.random.randn(120)})
N = len(df3)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df3.groupby('group_id')))

#     group_id      vals# 48         A  1.214525# 19         A -0.237562# 0          A  3.385037# 11         A  1.948405# 8          A  0.696629# 39         A -0.422851# 62         B  1.669020# 94         B  0.037814# 67         B  0.627173# 93         B  0.696366# 104        C  0.616140# 113        C  0.577033print(df3.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals# group_id# A        4          A  0.284448#          11         A  1.948405#          8          A  0.696629#          0          A  3.385037#          31         A  0.579405#          24         A -0.309709# B        70         B -0.480442#          69         B -0.317613#          96         B -0.930522#          80         B -1.184937# C        101        C  0.420421#          106        C  0.058900

This is the only time the second solution offered some consistency (out of sheer luck, I might add).

I hope this proves useful.

lacucinadiadine