Choosing Between Imputation Methods

January 21, 2024 Post a Comment

I'm trying to evaluate 2 methods for imputation of data. My dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data My target label is LotFrontage. Fir

Solution 1:

1.Imputation Using (Mean/Median) Values:

This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.

Pros: Easy and fast. Works well with small numerical datasets.

Cons: Doesn’t factor the correlations between features. It only works on the column level. Will give poor results on encoded categorical features (do NOT use it on categorical features). Not very accurate. Doesn’t account for the uncertainty in the imputations.

2.Imputation Using (Most Frequent) or (Zero/Constant) Values:

Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.

Pros: Works well with categorical features. Cons: It also doesn’t factor the correlations between features. It can introduce bias in the data.

Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify

3.Imputation Using k-NN:

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.

How does it work? It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.

Pros: Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

Cons: Computationally expensive. KNN works by storing the whole training dataset in memory. K-NN is quite sensitive to outliers in the data (unlike SVM)

Since the outlier ratio is low we can use method 3. It will also have less impact on the correlation between the imputed target variable(i.e LotFrontage) and other features.

import sys
from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS# start the KNN training
train_df['LotFrontage']=fast_knn(train_df[['LotFrontage','1stFlrSF','MSSubClass']], k=30)

I've chosen the two features considering their correlation with the LotFrontage column.

lacucinadiadine

Choosing Between Imputation Methods

Solution 1:

Post a Comment for "Choosing Between Imputation Methods"

Widget HTML #3