top of page

Feature Selection- Selection of the best that matters

In Machine learning we want our model to be optimized and fast in order to do so and to eliminate unnecessary variables we employ various feature selection techniques.

Top reasons to use feature selection are:

  • To train the machine learning model faster.

  • To improve the accuracy of a model, if the optimized subset is chosen.

  • To reduce the complexity of a model.

  • To reduce overfitting and make it easier to interpret.

Feature selection techniques that are easy to use and also gives good results are:

A) Filter Methods

  1. Dropping constant features

  2. Univariate Selection

  3. Feature Importance

  4. Correlation Matrix with Heat map

Dropping constant features

In this filter we can remove the features which have constant values, which are actually unimportant to solve the problem statement.

In Python the code to apply this using VarianceThreshhold feature of sklearn is:

from sklearn.feature_selection import VarianceThreshold
constant_columns = [column for column in data.columns
if column not in data.columns[var_thres.get_support()]]

Univariate Selection

In this type of selection Statistical tests are used to select those variables/features which have the strongest relationship with the result/output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

  • Pearson’s Correlation Coefficient: f_regression()

  • ANOVA: f_classif()

  • Chi-Squared: chi2()

The example below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the Mobile Price Range Prediction Dataset

To understand Chi2 we need to understand few terms as below:

Degrees of freedom:

Degrees of freedom refers to the maximum number of logically independent values, which have the freedom to vary. In simple words, it can be defined as the total number of observations minus the number of independent constraints imposed on the observations.

A chi-square test is used in statistics to test the independence of two events. Given the data of two variables, we can get observed count O and expected count E. Chi-Square measures how expected count E and observed count O deviates each other.

Xc2= ∑(Oi - Ei )2 / Ei


c= degree of freedom,

O= Observed Value(s)

E=Expected value(s)

When two features are independent, the observed count is close to the expected count, thus we will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of independence is incorrect. In simple words, higher the Chi-Square value the feature is more dependent on the response and it can be selected for model training.

In Python this type of filtration can be done by using SelectKBest and chi2 library from sklearn as in following code:

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
data = pd.read_csv("train.csv")
X = data.iloc[:,0:20#independent columns
y = data.iloc[:,-1]    #target column i.e price range

 #apply SelectKBest class to extract top 10 best features 
bestfeatures = SelectKBest(score_func=chi2, k=10) 
fit =,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

 #concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1) featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

Feature Importance

Feature importance is a kind of score for each feature in dataset, the higher the score more important or relevant is the feature towards for the result or dependent variable.

You can get the feature importance of each feature of your dataset by using the feature importance property of the model.

Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

In python it can be done by following code:

 from sklearn.ensemble import ExtraTreesClassifier 
import matplotlib.pyplot as plt 
model = ExtraTreesClassifier(),y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization feat_importances = pd.Series(model.feature_importances_, index=X.columns) feat_importances.nlargest(10).plot(kind='barh')

Correlation Matrix with Heatmap

Correlation states the relation between variables and the output or target variable.

Correlation can be positive (directly proportional) or negative (inversely proportional)

Heat map makes it easy to identify which features are most related to the target variable, we will plot heat map of correlated features using the seaborn library.

In Python it can be done by following code.

import seaborn as sns
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True ,

The dark color indicates higher correlation.

B) Wrapper Methods

Some common examples of wrapper methods are:

1) Forward feature selection

2) Backward feature elimination

3) Recursive feature elimination.

1: Forward Selection: Forward selection is an iterative method of optimization of feature selection in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

In python it can be implemented as:

1. # step forward feature selection
3. from sklearn.model_selection import train_test_split
4. from sklearn.ensemble import RandomForestRegressor
5. from sklearn.metrics import r2_score
6. from mlxtend.feature_selection import SequentialFeatureSelector as SFS
7. # select numerical columns:
9. numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
10. numerical_vars = list(data.select_dtypes(include=numerics).columns)
11. data = data[numerical_vars]
12. # separate train and test sets
13. X_train, X_test, y_train, y_test = train_test_split(
14.  X,Y, test_size=0.3, random_state=0)
15. # find and remove correlated features
16. def correlation(dataset, threshold):
17.  col_corr = set() # Set of all the names of correlated columns
18.  corr_matrix = dataset.corr()
19.  for i in range(len(corr_matrix.columns)):
20.  for j in range(i):
21.  if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
22.  colname = corr_matrix.columns[i] # getting the name of column
23.  col_corr.add(colname)
24.  return col_corr
26. corr_features = correlation(X_train, 0.8)
27. print('correlated features: ', len(set(corr_features)) )
28. # removed correlated features
29. X_train.drop(labels=corr_features, axis=1, inplace=True)
30. X_test.drop(labels=corr_features, axis=1, inplace=True)
31. X_train.fillna(0, inplace=True)
34. # step forward feature selection
36. from mlxtend.feature_selection import SequentialFeatureSelector as SFS
38. sfs1 = SFS(RandomForestRegressor(), 
39.  k_features=10, 
40.  forward=True, 
41.  floating=False, 
42.  verbose=2,
43.  scoring='r2',
44.  cv=3)
46. sfs1 =, y_train)
47. X_train.columns[list(sfs1.k_feature_idx_)]
48. sfs1.k_feature_idx_

2: Backward Elimination: In backward elimination, we start with all the features and removes the most insignificant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

In python it can be implemented as:

49. # step backward feature elimination
51. sfs1 = SFS(RandomForestRegressor(), 
52.  k_features=10, 
53.  forward=False, 
54.  floating=False, 
55.  verbose=2,
56.  scoring='r2',
57.  cv=3)
59. sfs1 =, y_train)
60. X_train.columns[list(sfs1.k_feature_idx_)]
61. sfs1.k_feature_idx_

3: Recursive Feature elimination: It is the most greedy optimization algorithm which aims to find the best performing feature subset.

It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are done with. It then ranks the features based on the order of their elimination.

In python it can be implemented as:

from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plt
# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y =
# Create the RFE object and rank each pixel
svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1), y)

C) Embedded Methods

1: LASSO Regression

Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.

Regularisation consists of adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model and in other words to avoid overfitting. In linear model regularisation, the penalty is applied over the coefficients that multiply each of the predictors. From the different types of regularisation, Lasso or l1 has the property that is able to shrink some of the coefficients to zero. Therefore, that feature can be removed from the model.

In python it is implemented as:

#load libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
# different scales, so it helps the regression to scale them
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X,Y, test_size=0.3,
scaler = StandardScaler()
# to force the algorithm to shrink some coefficients
sel_ = SelectFromModel(Lasso(alpha=100)), y_train)
# make a list with the selected features and print the outputs
selected_feat = X_train.columns[(sel_.get_support())]

We can see that Lasso regularisation helps to remove non-important features from the dataset. So, increasing the penalisation will result in increase the number of features removed.

If the penalty is too high and important features are removed, we will notice a drop in the performance of the algorithm and then realise that we need to decrease the regularisation.

2: Random Forest /Ensemble Techniques

Decision Tree Ensembles, also referred to as random forests, are useful for feature selection in addition to being effective classifiers.

Random forests are one the most popular machine learning algorithms. They are so successful because they provide in general a good predictive performance, low overfitting and easy interpretability.

For classification, the measure of impurity is either the Gini impurity or the information gain/entropy. For regression the measure of impurity is variance. Therefore, when training a tree, it is possible to compute how much each feature decreases the impurity. The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.

To give a better intuition, features that are selected at the top of the trees are in general more important than features that are selected at the end nodes of the trees, as generally the top splits lead to bigger information gains.

In python it can be implemented as:

# Import libraries
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# Encode categorical variables
X = pd.get_dummies(X, prefix_sep='_')
y = LabelEncoder().fit_transform(y)
# Normalize feature vector
X2 = StandardScaler().fit_transform(X)
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size = 0.30, random_state = 0)
# instantiate the classifier with n_estimators = 100
clf = RandomForestClassifier(n_estimators=100, random_state=0)
# fit the classifier to the training set, y_train)
# predict on the test set
y_pred = clf.predict(X_test)

Thanks for reading!

1,909 views0 comments

Recent Posts

See All


Evaluat(ă) cu 0 din 5 stele.
Încă nu există evaluări

Adaugă o evaluare
bottom of page