
Feature Importance vs. Feature Selection: How are they related?
In machine learning, feature selection and feature importance play pivotal roles in constructing accurate and efficient predictive models. These concepts are essential for optimizing model performance, reducing dimensionality, enhancing interpretability, and improving generalization.
In this article, I will delve into the significance of feature importance and feature selection, explain how these terms are related and how they differ, explore various algorithms associated with them, and shed light on their impact on machine learning models.
Feature Importance and Feature Selection: Definitions
To comprehend the essence of feature importance and feature selection, let’s begin by defining these terms. In machine learning, features refer to the individual variables or attributes of a dataset that are utilized to predict the target variable.
Feature importance refers to the degree of influence each feature has on the output or prediction made by a classifier or regression model. It quantifies the relevance or contribution of each feature to the predictive power of the algorithm.
Feature selection, on the other hand, is the process of choosing a subset of features from the original set of input features. The objective of feature selection is to identify the most relevant features that have a significant impact on the model’s performance while eliminating or reducing the impact of redundant or irrelevant features. By reducing the number of features of the dataset, feature selection not only improves computational efficiency but also helps in mitigating overfitting and improving the interpretability of the model.
Feature Importance vs Feature Selection: how are they related?
Feature selection and feature importance are closely related concepts in machine learning. In fact, most feature selection methods rely on first deriving the feature’s importance and then selecting features based on a ranking of the feature’s importance. The main difference between the terms is their aim.
Feature importance quantifies the relevance or contribution of each feature to the predictive power of a machine learning model. Its aim is to identify the most important features that significantly impact the model’s performance. Data scientists use feature importance to better understand what drives the model output and also inform clients regarding why a certain decision was made.
Feature selection, on the other hand, is the process of choosing the best features from the original features. The goal of feature selection is to select the most relevant features, reducing the number of features that will be used to train the machine learning model. Reducing the number of input features to a machine learning model reduces its training time and also the time to obtain a prediction, which are crucial when putting models into production.
So, how are these terms related? Feature importance plays a crucial role in guiding the feature selection process by providing insights into which features have the greatest influence on the target variable. In fact, as I mentioned previously, many feature selection algorithms involve deriving the feature importance first, ranking the features, and finally selecting the top-ranking features.
The main difference is the aim of each process. Feature importance helps us understand the relationship or contribution of each feature to a prediction and, hence, how a feature is related to the target variable. Feature selection aims to reduce the number of features that will be used to train machine learning models in order to have more efficient models in production.
Feature selection methods
Let’s explore some popular feature selection techniques that are widely used in data science and machine learning.
-
Filter Methods: Filter methods involve evaluating the statistical properties of features independently of any machine learning algorithm. These methods utilize statistical tests, such as chi-square tests for categorical variables or Pearson correlation coefficients for continuous variables, to assess the relevance of each feature to the target variable. Features are ranked based on the test p-values, and a predetermined threshold is applied to select the best-performing features. In a sense, the p-values provide information about the feature’s importance.
-
Wrapper Methods: Wrapper methods assess the performance of a machine learning algorithm using different subsets of features. They involve training and evaluating the model with various feature subsets and selecting the subset of features that yields the best value for a certain performance metric. Although wrapper methods can be computationally expensive, they tend to yield more accurate results as they consider the interactions between features. Wrapper methods optimize the model’s performance without deriving individual feature importance. These are probably the only feature selection methods that do not rely on feature importance to select the best set of features.
-
Embedded Methods: Embedded methods, as the name suggests, incorporate the feature selection process within the machine learning algorithm itself. These methods determine the importance of features during the model’s training process. Linear models assign weights or coefficients to each feature, while decision tree-based models calculate the importance based on the reduction in impurity attained after each split. The Lasso regularization is well known for its ability to select features, as it naturally shrinks some coefficients to zero.
-
Other Methods: There are other feature selection methods that don’t fall into the above categories. For example, we can rank features based on their importance derived from training a single feature model or by permutation, and then select the top-ranking features based on some criterion.
For more details on feature selection, check out our course Feature Selection for Machine Learning or our book Feature Selection in Machine Learning with Python.
Calculating Feature Importance
There are various methods to infer the feature importance:
-
Importance derived from linear models: linear models like linear regression and logistic regression assign a coefficient to each feature. The magnitude of these coefficients determines the importance of the corresponding feature. Larger coefficients indicate a stronger influence on the target variable.
-
Importance derived from tree-based models: In decision tree-based algorithms such as Random Forest and XGBoost, feature importance can be calculated by analyzing the number of times a feature is used to split the data across multiple decision trees, as well as the degree o reduction in the impurity. The more frequently a feature is utilized for splitting, and the higher the impurity reduction, the higher its importance score.
-
Importance derived from statistical tests: Statistical tests like chi-square, ANOVA, and correlation assign a probability value that informs how likely the null hypothesis is. The null hypothesis is that there is no difference in the feature distribution among the target classes in classification and that the predictor and the target are not correlated for regression. The smaller the probability of the null hypothesis, the more important the feature is.
-
Permutation feature importance: Permutation feature importance is a technique that measures the importance of features by shuffling their values and evaluating their impact on the model’s performance. The performance degradation caused by permuting the different features determines each feature’s importance for that particular machine learning model.
-
Importance derived from single feature classifiers or regressors. This procedure involves training a single-feature machine learning model and then obtaining a performance metric value. These values are related to the importance of the feature for this classification or regression task.
Feature Selection and Feature Importance with Python
Python, along with popular machine learning libraries such as Scikit-learn (Sklearn), provides a wide range of tools and functions to perform feature selection and compute feature importance scores. Scikit-learn offers most, if not all, filter methods out-of-the-box. And it also provides embedded methods for feature selection, which rely on the feature importance derived from decision tree-based models or linear and logistic regression.
Scikit-learn offers, in addition, Recursive Feature Elimination (RFE), which recursively eliminates less important features, and forward and backward elimination (two wrapper methods).
Feature-engine is an open-source Python library that offers a wide repertoire of feature selection algorithms, including importance and selection derived from single feature classifiers or regressors, recursive feature elimination, and selection by feature permutation.
Finally, MLXtend is another open-source Python library that offers all three wrapper methods out of the box.
Wrapping up
To wrap up, let’s summarize what we’ve discussed so far. Feature importance refers to the degree of influence each feature has on the output of a machine learning model. It helps us understand why a model makes a prediction. And it also helps data scientists and selection algorithms select the best features.
Feature selection is the process of selecting the most relevant features for a machine learning model. The objective is to train models on limited feature spaces to reduce training and scoring times and improve interpretability (for humans).
We can obtain feature importance and also select features with Python, utilizing the libraries Scikit-learn, Feature-Engine, and MLXtend.
Data preprocessing and feature engineering will affect feature importance and, therefore, feature selection. For example, the different encoding methods used to convert categorical variables into numbers will return new features that may be more or less predictive for certain machine learning models. The same is true if we transform variables with math functions or discretize them.
Deriving feature significance from deep learning models is harder. Methods like LIME and SHAP help us understand what drives the model’s decisions in local areas.
Additional resources