Skip to content Skip to sidebar Skip to footer

How To Make Pipeline For Multiple Dataframe Columns?

I have Dataframe which can be simplified to this: import pandas as pd df = pd.DataFrame([{ 'title': 'batman', 'text': 'man bat man bat', 'url': 'batman.com', 'label':1}, {'titl

Solution 1:

Take a look at the following link: http://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html

classItemSelector(BaseEstimator, TransformerMixin):def__init__(self, key):
    self.key = key

deffit(self, x, y=None):
    returnselfdeftransform(self, data_dict):
    return data_dict[self.key]

The key value accepts a panda dataframe column label. When using it in your pipeline it can be applied as:

('tfidf_word', Pipeline([
            ('selector', ItemSelector(key='column_name')),
            ('tfidf', TfidfVectorizer())), 
            ]))

Solution 2:

@elphz answer is a good intro to how you could use FeatureUnion and FunctionTransformer to accomplish this, but I think it could use a little more detail.

First off I would say you need to define your FunctionTransformer functions such that they can handle and return your input data properly. In this case I assume you just want to pass the DataFrame, but ensure that you get back a properly shaped array for use downstream. Therefore I would propose passing just the DataFrame and accessing by column name. Like so:

deftext(X):
    return X.text.values

deftitle(X):
    return X.title.values

pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])

pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])

Now, to test the variations of transformers and classifiers. I would propose using a list of transformers and a list of classifiers and simply iterating through them, much like a gridsearch.

tfidf = TfidfVectorizer()
cv = CountVectorizer()
lr = LogisticRegression()
rc = RidgeClassifier()

transformers = [('tfidf', tfidf), ('cv', cv)]
clfs = [lr, rc]

best_clf = None
best_score = 0
for tran1 in transformers:
    for tran2 in transformers:
        pipe1 = Pipeline(pipe_text.steps + [tran1])
        pipe2 = Pipeline(pipe_title.steps + [tran2])
        union = FeatureUnion([('text', pipe1), ('title', pipe2)])
        X = union.fit_transform(df)
        X_train, X_test, y_train, y_test = train_test_split(X, df.label)
        for clf in clfs:
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
            if score > best_score:
                best_score = score
                best_est = clf

This is a simple example, but you can see how you could plug in any variety of transformations and classifiers in this way.

Solution 3:

I would use a combination of FunctionTransformer to select only certain columns, and then FeatureUnion to combine TFIDF, word count, etc features on each column. There may be a slightly cleaner way, but I think you'll end up with some sort of FeatureUnion and Pipeline nesting regardless.

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

deffirst_column(X):
    return X.iloc[:, 0]

defsecond_column(X):
    return X.iloc[:, 1]

# pipeline to get all tfidf and word count for first column
pipeline_one = Pipeline([
    ('column_selection', FunctionTransformer(first_column, validate=False)),
    ('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
                                        ('counts', CountVectorizer())

    ]))
])

# Then a second pipeline to do the same for the second column
pipeline_two = Pipeline([
    ('column_selection', FunctionTransformer(second_column, validate=False)),
    ('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
                                        ('counts', CountVectorizer())

    ]))
])


# Then you would again feature union these pipelines # to get different feature selection for each column
final_transformer = FeatureUnion([('first-column-features', pipeline_one),
                                  ('second-column-feature', pipeline_two)])

# Your dataframe has your target as the first column, so make sure to drop first
y = df['label']
df = df.drop('label', axis=1)

# Now fit transform should work
final_transformer.fit_transform(df)

If you don't want to apply multiple transformer to each column (tfidf and counts both likely won't be useful) then you could cut down on the nesting by one step.

Post a Comment for "How To Make Pipeline For Multiple Dataframe Columns?"