Skip to content Skip to sidebar Skip to footer

Specific Number Of Test/train Size For Each Class In Sklearn

Data: import pandas as pd data = pd.DataFrame({'classes':[1,1,1,2,2,2,2],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]}) My code: import numpy as np from sklearn.cross_validati

Solution 1:

There is actually no sklearn function or parameter to do this directly. The stratify samples proportionately, which is not what you want as you indicated in your comment.

You can build a custom function, which is relatively slower but not tremendously slow on an absolute basis. Note that this is built for pandas objects.

deftrain_test_eq_split(X, y, n_per_class, random_state=None):
    if random_state:
        np.random.seed(random_state)
    sampled = X.groupby(y, sort=False).apply(
        lambda frame: frame.sample(n_per_class))
    mask = sampled.index.get_level_values(1)

    X_train = X.drop(mask)
    X_test = X.loc[mask]
    y_train = y.drop(mask)
    y_test = y.loc[mask]

    return X_train, X_test, y_train, y_test

Example case:

data = pd.DataFrame({'classes': np.repeat([1, 2, 3], [10, 20, 30]),
                     'b': np.random.randn(60),
                     'c': np.random.randn(60)})
y = data.pop('classes')

X_train, X_test, y_train, y_test = train_test_eq_split(
    data, y, n_per_class=5, random_state=123)

y_test.value_counts()
# 3    5# 2    5# 1    5# Name: classes, dtype: int64

How it works:

  1. Perform a groupby on X and sample n values from each group.
  2. Get the inner index of this object. This is the index for our test sets, and its set difference with the original data is our train index.

Post a Comment for "Specific Number Of Test/train Size For Each Class In Sklearn"