Feature selection in machine learning with Python
An essential step in any data science project is to select the most predictive variables. There are various methods of feature selection. Some scale well but consider only features individually. Some are extremely computationally costly and thus applicable only to relatively small datasets. Some others fall somewhere in the middle.
There are also various implementations of feature selection algorithms in Python, utilizing open-source libraries. Covering all of them in an article, is almost impossible. Instead, in this blog, I will introduce the Python libraries for feature selection, highlight which selection methods are available in each of them, and then demo some of the feature selection implementations.
For tutorials and step by step code implementations on additional feature selection methods, check out our course Feature Selection for Machine Learning or our book Feature Selection in Machine Learning with Python.
You may also like my talk at DataTalks.Club:
So how can we do feature selection in Python?
Before we jump onto the demos, let’s do a short recap on feature selection and the different methods that we can use to find the best feature subsets.
Feature selection methods
In feature selection, we select a subset of features from the data set to train machine learning algorithms. Feature selection techniques differ from dimensionality reduction in that they do not alter the original representation of the variables but merely select a smaller set of features. Remember that methods like PCA create new features by combining the existing ones into the principal components.
By reducing the number of features, we can improve the performance of the machine learning models (i.e., avoid overfitting), while reducing training time and creating more interpretable machine learning models.
Feature selection methods have been traditionally grouped into filter methods, wrapper methods, and embedded methods.
Image taken from Feature Selection in Machine Learning with Python
Filter methods select the best features based on the feature characteristics, ignoring their interaction with the machine learning model. They rank the features and then select the top-ranking ones. Ranking methods normally use statistical tests like chi-square, ANOVA, correlation, and mutual information.
Wrapper methods wrap the search for the most relevant features around a predictive model. They generate multiple feature subsets and then evaluate their performance based on the classifier or regression model. The selected features are those from the subset that returned the best performing model.
Embedded methods “embed” the selection procedure in the training of the predictive model. Lasso and feature importance from decision trees are the classical examples of embedded methods. The coefficients of linear models can also be used to select important features.
Feature selection and feature engineering are widely used in data science during the preprocessing of the data. So how can we do that in Python?
Python libraries for feature selection
There are 3 Python libraries with feature selection modules: Scikit-learn, MLXtend and Feature-engine.
Scikit-learn contains algorithms for filter methods, wrapper methods and embedded methods, including recursive feature elimination.
MLXtend contains transformers to implement forward, backward and exhaustive search.
Feature-engine contains alternative feature selection methods based on machine learning model performance, feature shuffling and also feature selection techniques that support categorical variables.
Feature selection implementations by the different libraries.
In this article, we will implement various feature selection techniques with Scikit-learn and Feature-engine.
Feature selection with Scikit-learn
Scikit-learn contains algorithms for filter methods, wrapper methods and embedded methods, including recursive feature elimination. Among the filter methods, we can select features using their variance or based on ANOVA. Let’s explore these procedures.
With Scikit-learn, we can remove irrelevant features by looking at feature variability. Features whose standard deviation is zero are constant and can be removed. In this example, we will create a toy dataset with 3 constant variables, and then we will remove them with Scikit-learn.
Let’s begin by creating a toy dataset with 3 constant features:
import pandas as pd from sklearn.datasets import make_classification from sklearn.feature_selection import VarianceThreshold # Toy dataset with redundant and constant features X, y = make_classification( n_samples=1000, n_features=10, n_classes=2, random_state=10, ) X = pd.DataFrame(X) # Add constant features X[[0, 5, 9]] = 1
In the previous code snippet, we created a toy dataset with the function
from Scikit-learn. You will most likely load your own dataset with pandas
Now, we will remove the constant features with the
VarianceThreshold class from Scikit-learn:
# To remove constant features sel = VarianceThreshold(threshold=0) # fit finds the features with zero variance X_t = sel.fit_transform(X)
In the previous code snippet, we first set up the transformer, and then applied the
method fit, followed by transform. The transformer discovered the constant features using
transform(), it removed them from the data.
X_t contains predictors whose variability is greater than 0.
The chi-square test is suitable for selecting categorical variables when the target variable is also categorical. It ranks features based on the p-values returned by the test and then selects the top-ranked features.
Note that Scikit-learn’s chi-square function does not carry out the intended procedure. This is a known
issue. Instead, use
ANOVA is suitable for selecting continuous variables when the target variable is categorical. Let’s explore how we can select features using ANOVA and Scikit-learn. We will use the breast cancer dataset.
Let’s load the dataset and separate it into a train and a test set:
import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.feature_selection import SelectKBest from sklearn.model_selection import train_test_split # load dataset breast_cancer = load_breast_cancer() X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names) y = breast_cancer.target # Separate data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
I like using the train_test_split function from Scikit-learn to separate the data. You could also use pandas iloc utilizing index values extracted at random.
Now, we will select the 10 features with the lowest p-value for ANOVA:
# Rank and select features sel = SelectKBest(score_func = f_classif, k=10).fit(X_train, y_train) # remove features X_train_t = sel.transform(X_train) X_test_t = sel.transform(X_test)
SelectKBest we indicate the number of features we want to select. This is an arbitrary value, but can be optimized
Lasso can shrink some of the coefficients of a linear model to 0, therefore selecting features out-of-the-box. Here, I’ll show how to select features using Lasso using a classification and a regression dataset.
Let’s begin by importing the libraries, functions, and classes:
import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer, fetch_california_housing from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import Lasso, LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
We will next import the breast cancer dataset from Scikit-learn with the aim of predicting if a tumor is benign or malignant. This is a classification dataset. Next, we will split the data into a training and a testing set:
breast_cancer = load_breast_cancer() X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names) y = breast_cancer.target X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Let’s set up the standard scaler from Scikit-learn:
scaler = StandardScaler() scaler.fit(X_train)
Next, we will select features utilizing logistic regression as a classifier, with the Lasso regularization:
selector = SelectFromModel( LogisticRegression(C=0.5, penalty='l1', solver='liblinear', random_state=10)) selector.fit(scaler.transform(X_train), y_train)
selector.get_support() we obtain a boolean vector with True for the features that have non-zero coefficients:
array([False, True, False, False, False, False, False, True, True, False, True, False, False, False, False, True, False, False, False, True, True, True, True, True, True, False, True, True, True, False])
We can identify the names of the set of features that will be removed like this:
removed_feats = X_train.columns[(selector.estimator_.coef_ == 0).ravel().tolist()]
If we execute removed_feats we obtain the following array with the features that will be removed:
Index(['mean radius', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean fractal dimension', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'concavity error', 'concave points error', 'symmetry error', 'worst compactness', 'worst fractal dimension'], dtype='object')
We can remove the features from the training and testing sets like this:
X_train_selected = selector.transform(scaler.transform(X_train)) X_test_selected = selector.transform(scaler.transform(X_test))
If we now execute X_train_selected.shape, X_test_selected.shape, we obtain the shapes of the reduced datasets: ((426, 14), (143, 14)).
Go ahead and change the value of the penalty (C) to see if the result changes. The best value of C, and thus, the best feature subset, can be determined with cross-validation.
Recursive feature elimination
Recursive feature elimination is a sequential process where a feature is removed after each iteration, and the importance of the features is reassessed after each elimination. In Scikit-learn, we can implement recursive feature elimination with the RFE or RFECV.
Let’s select features recursively using the importance of random forests. We will use the breast cancer data set, and separate the data into a train and a test data set. Features should be selected based on the training data only.
import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import RFE from sklearn.model_selection import train_test_split # load dataset breast_cancer = load_breast_cancer() X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names) y = breast_cancer.target # Separate data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) clf = RandomForestClassifier(n_estimators=10, random_state=10) sel_ = RFE( clf, n_features_to_select=8, step=2, ) sel_.fit(X_train, y_train) X_train_selected = sel_.transform(X_train) X_test_selected = sel_.transform(X_test)
The result consists of Numpy arrays with the selected features.
Feature selection with Feature-engine
Feature-engine contains many classes to select features based on recursive feature elimination or addition, feature shuffling, population stability index, mean target value, cardinality and more. Check out Feature-engine’s documentation for more details.
Univariate feature selection
Feature-engine includes univariate feature selection methods based on the target variable mean value per category or bin and a single feature classifier or regressor performance metric.
In single feature model performance, a machine learning model is trained for each feature, with only that feature as input, and the features are ranked based on this model performance.
Let’s select features based on single feature model performance utilizing cross-validation:
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.tree import RandomForestClassifier from feature_engine.selection import SelectBySingleFeaturePerformance # load dataset breast_cancer = load_breast_cancer() X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names) y = breast_cancer.target # Separate data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) sel = SelectBySingleFeaturePerformance( estimator=RandomForestClassifier(random_state=10), scoring='roc_auc', cv=3, threshold=None, ) X_train_t = sel.fit_transform(X_train, y_train) X_test_t = sel.transform(X_test)
We can do some visualization to explore the importance of the features based on the single feature classifiers:
pd.Series(sel.feature_performance_).sort_values( ascending=False).plot.bar(figsize=(10, 5)) plt.ylabel('roc-auc') plt.title('Univariate performance')
Image taken from Feature Selection in Machine Learning with Python
When training linear models like linear or logistic regression, multicollinearity may affect model performance. Thus, it might be useful to remove correlated features.
Feature-engine contains algorithms that select features based on the feature correlation. The
finds groups of correlated features and then retains the one with fewer missing data points, higher cardinality or variability,
or greater model derived importance.
How can we find correlated features? We can use the
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_theme() from sklearn.datasets import make_classification # Toy dataset with correlated features X, y = make_classification( n_samples=1000, n_features=10, n_redundant=7, n_classes=2, random_state=10, ) X = pd.DataFrame(X) # the default correlation method of pandas.corr is pearson corrmat = X_train.corr(method='pearson') # we can make a heatmap with seaborn sns.heatmap(corrmat, annot=True) plt.show()
Image taken from Feature Selection in Machine Learning with Python
More feature selection in Python
If you want to learn more about feature selection and how to carry it out in Python, check out our course and book:
- course: Feature Selection for Machine Learning
- book: Feature Selection in Machine Learning with Python
Both the course and the book contain a great amount of information regarding:
- the logic of the algorithm
- their advantages and limitations
- code examples using real world datasets
Course and book are suitable for beginner and intermediate data scientists alike.
Additional reading resources