How To Make Pipeline For Multiple Dataframe Columns?

April 21, 2024 Post a Comment

I have Dataframe which can be simplified to this: import pandas as pd df = pd.DataFrame([{ 'title': 'batman', 'text': 'man bat man bat', 'url': 'batman.com', 'label':1}, {'titl

Solution 1:

Take a look at the following link: http://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html

classItemSelector(BaseEstimator, TransformerMixin):def__init__(self, key):
    self.key = key

deffit(self, x, y=None):
    returnselfdeftransform(self, data_dict):
    return data_dict[self.key]

The key value accepts a panda dataframe column label. When using it in your pipeline it can be applied as:

('tfidf_word', Pipeline([
            ('selector', ItemSelector(key='column_name')),
            ('tfidf', TfidfVectorizer())), 
            ]))

Solution 2:

@elphz answer is a good intro to how you could use FeatureUnion and FunctionTransformer to accomplish this, but I think it could use a little more detail.

First off I would say you need to define your FunctionTransformer functions such that they can handle and return your input data properly. In this case I assume you just want to pass the DataFrame, but ensure that you get back a properly shaped array for use downstream. Therefore I would propose passing just the DataFrame and accessing by column name. Like so:

deftext(X):
    return X.text.values

deftitle(X):
    return X.title.values

pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])

pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])

Now, to test the variations of transformers and classifiers. I would propose using a list of transformers and a list of classifiers and simply iterating through them, much like a gridsearch.

tfidf = TfidfVectorizer()
cv = CountVectorizer()
lr = LogisticRegression()
rc = RidgeClassifier()

transformers = [('tfidf', tfidf), ('cv', cv)]
clfs = [lr, rc]

best_clf = None
best_score = 0
for tran1 in transformers:
    for tran2 in transformers:
        pipe1 = Pipeline(pipe_text.steps + [tran1])
        pipe2 = Pipeline(pipe_title.steps + [tran2])
        union = FeatureUnion([('text', pipe1), ('title', pipe2)])
        X = union.fit_transform(df)
        X_train, X_test, y_train, y_test = train_test_split(X, df.label)
        for clf in clfs:
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
            if score > best_score:
                best_score = score
                best_est = clf

This is a simple example, but you can see how you could plug in any variety of transformations and classifiers in this way.

Solution 3:

I would use a combination of FunctionTransformer to select only certain columns, and then FeatureUnion to combine TFIDF, word count, etc features on each column. There may be a slightly cleaner way, but I think you'll end up with some sort of FeatureUnion and Pipeline nesting regardless.

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

deffirst_column(X):
    return X.iloc[:, 0]

defsecond_column(X):
    return X.iloc[:, 1]

# pipeline to get all tfidf and word count for first column
pipeline_one = Pipeline([
    ('column_selection', FunctionTransformer(first_column, validate=False)),
    ('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
                                        ('counts', CountVectorizer())

    ]))
])

# Then a second pipeline to do the same for the second column
pipeline_two = Pipeline([
    ('column_selection', FunctionTransformer(second_column, validate=False)),
    ('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
                                        ('counts', CountVectorizer())

    ]))
])


# Then you would again feature union these pipelines # to get different feature selection for each column
final_transformer = FeatureUnion([('first-column-features', pipeline_one),
                                  ('second-column-feature', pipeline_two)])

# Your dataframe has your target as the first column, so make sure to drop first
y = df['label']
df = df.drop('label', axis=1)

# Now fit transform should work
final_transformer.fit_transform(df)

If you don't want to apply multiple transformer to each column (tfidf and counts both likely won't be useful) then you could cut down on the nesting by one step.

lacucinadiadine

How To Make Pipeline For Multiple Dataframe Columns?

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "How To Make Pipeline For Multiple Dataframe Columns?"

Widget HTML #3