How to use Feature Importance with XGBoost

· 22 min read

Table of Contents


    Understanding the crucial features in your dataset can be highly advantageous when training machine learning models.

    Specifically, in XGBoost, a powerful gradient boosting framework used for developing predictive models, understanding feature importance is vital. Training an XGBoost model with an emphasis on feature importance allows you to enhance the model's performance by focusing on the most impactful features within your dataset.

    In this guide, we will delve deep into the methods, best practices, and interpretations of feature importance in XGBoost. We will explore the built-in feature importance in XGBoost, the permutation-based feature importance, and SHAP (Shapley Additive exPlanations) values, each with their unique approaches and considerations. We'll discuss these methods along with code examples and visualizations to drive a comprehensive understanding of the topic.

    In addition, we will consider the importance of effective feature selection and the role of feature engineering in improving the performance of XGBoost models. These considerations can help mitigate challenges related to overfitting, computational expense, and interpretability of models.

    As we dive into this topic, remember that understanding the feature importance in your model is not just about improving model performance; it's about gaining a deeper understanding of your data and the factors that drive predictions.

    Understanding feature importance in XGBoost

    In predictive modeling, identifying which predictors, or features, contribute significantly to a model's output or prediction, is essential. This concept is known as feature importance, a valuable tool that enables us to interpret our models better. For XGBoost, a widely used gradient boosting library, understanding feature importance can be somewhat complex due to the nature of the model's composition.

    In XGBoost, there are several ways to quantify the importance of features within a model. The first method is the built-in feature importance, which computes the average gain across all the splits in which a feature is used. Essentially, this sums up the total gains of splits which use a particular feature as a predictor. The higher the gain, the more essential that feature is to the model’s prediction.

    The second method is the permutation-based feature importance. This technique involves randomly shuffling each feature and measuring the change in the model's performance. The idea here is simple yet effective; if a feature is critical, randomly shuffling it would negatively impact the performance of the model. Conversely, if a feature is not relevant, the model's performance will remain more or less stable despite its random permutation.

    The third method relies on SHAP (Shapley Additive exPlanations) values, a model-agnostic approach hailing from cooperative game theory. In the context of machine learning, this method assigns each feature an importance value for a particular prediction. The importance value is essentially the average marginal contribution of a feature across all possible feature combinations.

    This state-of-the-art approach offers more granular insight compared to other methods, as it gives us the contribution of each feature to every single prediction, rather than just an overall measure of importance.

    SHAP values example bar graph.

    A SHAP value graph showing the feature importance for number of bedrooms, size of the house and the age of the house for the effect on its price. Reducing the size of the house decreased the predicted price by approximately $24,770 compared to the base value, increasing the age of the house decreased the predicted price by approximately $37,120 compared to the base value, whilst the number of bedrooms does not have an impact on the price.

    It's worth noting that these methods come with their own set of considerations. For example, built-in feature importance may overestimate the importance of continuous and high-cardinality categorical variables, permutation-based importance can be computationally expensive, and interpreting SHAP values may require a deeper understanding of the theory behind the method. Additionally, the presence of highly correlated features can affect the results.

    In conclusion, understanding feature importance in XGBoost is a multi-faceted process, requiring careful use and interpretation of different methods. The feature importance metric you choose to use would depend on your specific use case and resources. Yet, regardless of the method employed, understanding feature importance is a powerful tool that can help us decipher our XGBoost models and make more informed decisions based on our predictions.

    Built-in feature importance in XGBoost

    In XGBoost, one of the most straightforward methods of understanding feature importance is by using the built-in feature importance method. This approach is particularly effective for getting a primary idea about which features are playing a crucial role in the model's predictive process.

    In this method, feature importance is calculated as the average gain across all the tree nodes that use a particular feature as a predictor. This is done by summing up the total gains of all the splits where the feature is used and averaging across all these splits. The gain is a measure of the relative improvement in accuracy brought by a feature to the splits. It provides a weight or score which signifies how useful or valuable each feature was in the construction of the boosted decision trees within the model.

    Example of feature importance tree.

    The tree starts with the root node. Here, the best feature to split on is determined to be feature X1 with a gain of 0.5. This means that by splitting on X1​, we have achieved a relative improvement in accuracy of 0.5. The left node then splits on feature feature X2 with a gain of 0.3. This indicates that splitting on X2 at this node brought a relative improvement in accuracy of 0.3. Leaf nodes are the terminal nodes of the tree. They don't make any further splits.

    Here's how you can use it in Python:

    import xgboost as xgb
    # Assuming that 'model' is a trained XGBoost model
    feature_important = model.get_booster().get_score(importance_type='weight')
    keys = list(feature_important.keys())
    values = list(feature_important.values())
    import matplotlib.pyplot as plt
    plt.barh(keys, values)

    In the code snippet above, we use the `get_score()` method to get the feature importance, specifying the importance type as 'weight', which is equivalent to the number of times a feature appears in the trees of the model. Then, we create a bar plot to visualize the importance of each feature.

    While this built-in method is relatively easier to implement, it's important to note that it has its limitations too. For instance, it might overstate the importance of continuous features or high-cardinality categorical variables. Since these types of features have a broader range of values, they are more likely to be used in splits, hence, might appear more important than they actually are. However, despite these considerations, the built-in feature importance provides a good starting point for understanding the key drivers of your XGBoost model's predictions.

    Permutation-based feature importance

    Another approach to assessing feature importance in XGBoost models is permutation-based feature importance. This technique, unlike the built-in method, takes a slightly different approach to evaluate the importance of a feature. It measures the decrease in a model's performance when a single feature's values are randomly shuffled across the observations.

    GIF illustrating how permutations work.

    To calculate permutation-based feature importance, each feature column in the validation dataset is permuted, and the model's performance drop is recorded. If the feature is crucial to the model's prediction, randomizing its values will significantly degrade the model's performance, indicating its importance. On the other hand, if a feature isn't crucial, randomizing its values won't significantly affect the model's performance, signifying it as a less important feature.

    Here's how you can calculate permutation-based feature importance in Python:

    from sklearn.inspection import permutation_importance
    # Assuming that 'model' is the trained XGBoost model and 'X_val' and 'y_val' are validation datasets
    results = permutation_importance(model, X_val, y_val, scoring='accuracy')
    importance = results.importances_mean
    for i, v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' %(i,v))

    In the snippet above, we use the `permutation_importance()` function from scikit-learn's inspection module. We pass the model, the validation data, and specify a scoring method. In this case, we're using accuracy. The function returns the mean importance of each feature. We then print the score of each feature to get a sense of which features are the most important according to this method.

    While this method offers a more unbiased estimation of feature importance, it comes with its own considerations. Permutation-based feature importance can be computationally expensive, particularly when dealing with high-dimensional datasets. This is because it has to permute each feature column one by one and recompute the model's performance.

    Despite this, it provides an insightful way to understand the role of each feature in the model's prediction, especially when we need a more objective measure that isn't influenced by the number of splits or the depth of the trees, unlike the built-in feature importance.

    Remember, the goal here is not to strictly rank features based on their importance, but to understand which features contribute to the model's predictions and, consequently, to improve the interpretability of your XGBoost models.

    Interpreting feature importance using SHAP values

    While both the built-in and permutation-based feature importance methods offer valuable insights into the role of features in an XGBoost model, they can sometimes fall short in providing a nuanced understanding. This is particularly the case when we need to assess the contribution of each feature to individual predictions, rather than an overall measure of feature importance. For such requirements, a powerful technique known as SHAP (Shapley Additive exPlanations) values comes into play.

    SHAP values take the idea of feature importance a step further. Instead of providing an overall measure of feature importance, they give us a measure of how much each feature in the dataset contributes to the prediction for each individual observation. In other words, for every prediction that the model makes, SHAP values can tell us which features were most responsible for the prediction and how much they contributed. This kind of fine-grained understanding of a model's decisions can be invaluable when it comes to interpreting and explaining the model's outputs.

    The SHAP method draws upon the concept of Shapley values from cooperative game theory. According to this concept, the contribution of each feature to a prediction should be calculated as the average marginal contribution of that feature across all possible feature combinations. The implementation of this concept in the context of feature importance in machine learning is what is known as SHAP.

    Here's how you can calculate and visualize SHAP values for an XGBoost model in Python:

    import shap
    # Assuming that 'model' is the trained XGBoost model and 'X_train' is the training dataset
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_train)
    shap.summary_plot(shap_values, X_train)

    In the code above, we use the `TreeExplainer` from the SHAP library to compute SHAP values for the features in the training data. The `summary_plot` function then creates a visualization of these values, highlighting the impact of each feature on the model's output.

    Interpreting a SHAP summary plot can provide valuable insights. Features are sorted by the sum of SHAP value magnitudes across all samples. Each point on the graph represents a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the SHAP value. The color represents the value of the feature from low to high. Overlapping points are jittered in y-axis direction, so we get a sense of the distribution of the SHAP values per feature. The features that push the prediction higher are shown in red, those pushing the prediction lower are in blue.

    Here's an example:

    Example SHAP summary value plot

    From this graph, using Feature_6 to help explain:

    1. There are both positive and negative SHAP values for "Feature_6", which means this feature can both increase or decrease the model's prediction depending on its value for a particular instance.
    2. The colors of the dots suggest that when "Feature_6" has higher values (darker red), it tends to have positive SHAP values, pushing the prediction higher. Conversely, when "Feature_6" has lower values (darker blue), it tends to have negative SHAP values, pulling the prediction lower.

    While the SHAP method provides a more detailed and nuanced understanding of feature importance, interpreting SHAP values may require a deeper understanding of the underlying theory, and computing SHAP values can be computationally intensive with high-dimensional data. Still, when used wisely, SHAP values can serve as a powerful tool for interpreting machine learning models, including those built with XGBoost.

    Best Practices for Feature Selection in XGBoost

    Understanding feature importance is crucial when working with machine learning models like XGBoost, but it's not the only aspect to consider. Selecting the right features can have a substantial impact on the model's performance, training time, and interpretability. Here, we will take a closer look at best practices for feature selection when training an XGBoost model.

    Feature selection involves determining the most relevant features to use in model training. It's an essential step in building an effective machine learning model, as it can lead to improved model performance, less overfitting, faster training times, and better understandability of the model.


    When deciding which features to include in your model, keep in mind that XGBoost, like any machine learning algorithm, cannot extract complex relations between the features on its own. Therefore, manual feature engineering, which involves creating new features based on existing ones or transforming features to make them more useful, is still a vital step in the overall process. Through feature engineering, you can help expose these relations to the model, potentially improving its performance.

    Consider the following practices for feature selection in XGBoost:

    • Check for highly correlated features: If two features are highly correlated, keeping both in the model can lead to overfitting and can make the feature importance results difficult to interpret. In such cases, consider removing one of the correlated features.
    • Perform feature importance analysis: As discussed in the previous sections, use methods like built-in feature importance, permutation-based importance, and SHAP values to understand your data better and to identify the most impactful features.
    • Consider the use of domain knowledge: If you have a deep understanding of the domain your data comes from, leverage this knowledge to select features. A feature that has strong theoretical or empirical relevance to your output variable might be worth including, even if it doesn\'t show high importance in initial analyses.
    # Assuming that 'model' is the trained XGBoost model
    from sklearn.feature_selection import SelectFromModel
    selection = SelectFromModel(model, threshold=0.15, prefit=True)
    selected_dataset = selection.transform(X_train)
    print("Total features: ", X_train.shape[1])
    print("Selected features: ", selected_dataset.shape[1])

    In the Python script above, we use SelectFromModel from scikit-learn’s feature_selection module to select the features with importance greater than the threshold value. In this case, the threshold is set to 0.15, meaning we select features that have a feature importance value of 0.15 or more.

    In conclusion, feature selection is an essential step when creating an XGBoost model. By selecting relevant features, you can ensure that your model operates efficiently and effectively, resulting in more accurate results and simpler, more interpretable models. As always, remember to evaluate your model after feature selection to ensure that its performance is satisfactory.

    The Role of Feature Engineering in XGBoost

    While understanding and selecting the right features is crucial for training your XGBoost model, the process doesn't end there. Effective feature engineering can significantly enhance the performance of your model. So, what is feature engineering, and why does it matter?

    Feature engineering is the process of transforming raw data into a format that is better understood by machine learning algorithms. It involves extracting, creating or modifying features to improve a model's predictive performance, facilitate its interpretability, and reduce computational or data needs.

    Let's delve deeper into how feature engineering can significantly influence XGBoost models:

    • Creating Interaction Features: XGBoost, while being a highly powerful algorithm, may not efficiently capture complex interactions between features. Using domain expertise or automated techniques, you can create interaction features that signify relationships between two or more features, potentially boosting your model\'s performance.
    • Handling Missing Values: Missing data can pose challenges during the training of a model. XGBoost has an inherent method to handle missing values; however, it might not always be the best strategy. Therefore, developing well-informed strategies to impute missing values can result in a more robust model.
    • High-Cardinality Categorical Features: XGBoost does not natively handle categorical data. High-cardinality categorical features — those with a large number of unique values — can be particularly taxing to encode and include in the model. Using feature engineering techniques like binary encoding, hashing, or target encoding can make these features more manageable and useful.
    • Continuous Feature Transformations: Transformations like log, square, or square root can sometimes help handle outliers or skewness in continuous features, or expose linear relationships masked by scale.

    Here is an example of how you could create an interaction feature and handle missing values in Python:

    import pandas as pd
    import numpy as np
    # Assuming 'df' is your DataFrame and 'f1', 'f2', 'f3' are features in your dataset
    # Create interaction feature
    df['f1_f2_interaction'] = df['f1'] * df['f2']
    # Handle missing values by replacing them with mean
    df['f3'].fillna(df['f3'].mean(), inplace=True)

    In the code snippet above, we first create a new feature `f1_f2_interaction` which is the product of `f1` and `f2`. Such interaction terms can often help the model uncover complex relationships between features. Secondly, we handle missing values in `f3` by replacing them with the mean of `f3`. This is a simple and commonly used imputation strategy, but it's not always appropriate and other methods should be considered depending on the nature of your data.

    Take note, feature engineering is more of an art than a science, requiring both creativity and a solid understanding of the data and the domain from which it comes. It may not always lead to huge boosts in performance, but when done right, it can help squeeze out that extra bit of accuracy from your model, while making it more interpretable and faster to train.

    Bear in mind, that the newly engineered features, like any other features, should be assessed for their relevance and importance to the model. Therefore, after carrying out feature engineering, you should again consider applying the methods of understanding feature importance, as discussed in the previous sections.

    In conclusion, understanding feature importance is not a standalone task, but rather a part of a larger cycle of model development that also includes feature selection and feature engineering. Each of these steps feeds into the other, and a comprehensive grasp of all three is necessary to develop effective and interpretable XGBoost models.


    In this guide, we have explored the importance of understanding feature importance in XGBoost models and how it can result in more accurate and efficient models. We delved into the built-in feature importance provided by XGBoost, permutation-based feature importance, and the powerful SHAP values, each with their unique approaches and considerations.

    Moreover, we have discussed the role of effective feature selection and feature engineering in enhancing the overall performance of your XGBoost models. From checking for high correlation among features to creating interaction features, we have unraveled the best practices that can help you develop more robust and interpretable models.

    Undoubtedly, understanding feature importance is not just about improving model performance; it's about gaining a deeper understanding of your data and the factors that drive predictions. So, as you continue your journey in machine learning, make sure to leverage these approaches and practices to better understand your models and make more informed decisions based on your predictions. Remember, the more you understand your model, the more confident you can be in your predictions!

    Richard Lawrence

    About Richard Lawrence

    Constantly looking to evolve and learn, I have have studied in areas as diverse as Philosophy, International Marketing and Data Science. I've been within the tech space, including SEO and development, since 2008.
    Copyright © 2024 evolvingDev. All rights reserved.