Compare Multiple Algorithms With Sklearn Pipeline
Solution 1:
Improving on Bruno's answer, what most people really want to do is be able to pass in ANY classifier (not have to hard-code each one) and also any parameters for each classifier. Here is an easy way to do this:
Create a switcher class that works for any estimator
from sklearn.base import BaseEstimator
classClfSwitcher(BaseEstimator):
def__init__(
self,
estimator = SGDClassifier(),
):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
deffit(self, X, y=None, **kwargs):
self.estimator.fit(X, y)
return self
defpredict(self, X, y=None):
return self.estimator.predict(X)
defpredict_proba(self, X):
return self.estimator.predict_proba(X)
defscore(self, X, y):
return self.estimator.score(X, y)
Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:
Perform hyper-parameter optimization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', ClfSwitcher()),
])
parameters = [
{
'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': ['english', None],
'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
'clf__estimator__max_iter': [50, 80],
'clf__estimator__tol': [1e-4],
'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
},
{
'clf__estimator': [MultinomialNB()],
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': [None],
'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
},
]
gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)
How to interpret clf__estimator__loss
clf__estimator__loss
is interpreted as the loss
parameter for whatever estimator
is, where estimator = SGDClassifier()
in the top most example and is itself a parameter of clf
which is a ClfSwitcher
object.
Solution 2:
Preprocessing
You say that preprocessing the data is very slow, so I assume that you consider the TF-IDF Vectorization part of your preprocessing.
You could preprocess just once.
X = <your original data>
from sklearn.feature_extraction.text importTfidfVectorizerX= TfidfVectorizer().fit_transform(X)
Once you have your new transformed data, you can continue using it and choose the best classifier.
Optimizing the TF-IDF Transformer
While you could transform your data with TfidfVectorizer
just once, I would not recommend it, because the TfidfVectorizer
has hyper-parameters itself, which can also be optimized. In the end, you want to optimize the whole Pipeline
together, because the parameters for the TfidfVectorizer in
a Pipeline [TfidfVectorizer, SGDClassifier]
can be different than for a Pipeline [TfidfVectorizer, MultinomialNB]
.
Creating a custom classifier
To give an answer to what you asked exactly, you could make your own estimator that has the choice of model as a hyper-parameter.
from sklearn.base import BaseEstimator
classMyClassifier(BaseEstimator):
def__init__(self, classifier_type: str = 'SGDClassifier'):
"""
A Custome BaseEstimator that can switch between classifiers.
:param classifier_type: string - The switch for different classifiers
"""
self.classifier_type = classifier_type
deffit(self, X, y=None):
if self.classifier_type == 'SGDClassifier':
self.classifier_ = SGDClassifier()
elif self.classifier_type == 'MultinomialNB':
self.classifier_ = MultinomialNB()
else:
raise ValueError('Unkown classifier type.')
self.classifier_.fit(X, y)
return self
defpredict(self, X, y=None):
return self.classifier_.predict(X)
You can then use this customer classifier in your Pipeline
.
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MyClassifier())
])
You can then you GridSearchCV
to choose the best model. When you create a parameter space, you can use double underscore to specify the hyper-parameter of a step in your pipeline
.
parameter_space = {
'clf__classifier_type': ['SGDClassifier', 'MultinomialNB']
}
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(pipeline , parameter_space, n_jobs=-1, cv=5)
search.fit(X, y)
print('Best model:\n', search.best_params_)
Post a Comment for "Compare Multiple Algorithms With Sklearn Pipeline"