Skip to content Skip to sidebar Skip to footer

What Is The Difference Between Fit_transform And Transform In Sklearn Countvectorizer?

I was recently practicing bag of words introduction : kaggle , I want to clear few things : using vectorizer.fit_transform( ' * on the list of *cleaned* reviews* ' ) Now when we w

Solution 1:

You do not do a fit_transform on the test data because, when you fit a Random Forest, the Random Forest learns the classification rules based on the values of the features that you provide it. If these rules are to be applied to classify the test set then you need to make sure that the test features are calculated in the same way using the same vocabulary. If the vocabulary of the training and the test features is different, then features will not really make sense as they will reflect a vocabulary that is separate from the one the document was trained on.

Now if we specifically talk about CountVectorizer, then consider the following example, let your training data have the following 3 sentences:

  1. Dog is black.
  2. Sky is blue.
  3. Dog is dancing.

Now the vocabulary set for this will be {Dog, is, black, sky, blue, dancing}. Now the Random Forest that you will train will try to learn rules based on the count of these 6 vocabulary terms. So your features will be vector of length 6. Now if the test set is as follows:

  1. Dog is white.
  2. Sky is black.

Now if you use the test data for fit_transform your vocabulary will look like {Dog, white, is, Sky, black}. So here your each document will be represented by a vector of length 5 denoting the counts of each of these terms. Now, this will be like comparing apples with oranges. You learn rules for counts of the previous vocabulary and those rules can not be applied to this vocabulary. This is the reason why you only fit on the training data.

Solution 2:

Basically you split the whole data into train and test to expose only the train data to the model and other statistical variable calculation like mean and standard deviations, if you expose the test data your model might not be generalized any more and chances of overfit. So expose only train data with fit_transform and use the statistical variables to the test data with transform.

Solution 3:

In short, fit is used to train the model, once it's trained you can use that model. To use of course you use transform. (Remember fit generally does calculations or normalization of data).

So you can use fit and transform on test data but it's not much wise decision as you duplicate the efforts (Your model is already trained using fit on train data) as well in long term it may lower the performance too.

Post a Comment for "What Is The Difference Between Fit_transform And Transform In Sklearn Countvectorizer?"