Introduction to shapr

An R-package for explaining predictive models using dependence-aware Shapley values

Main developers: Martin Jullum, Nikolai Sellereite & Annabelle Redelmeier

About the package

Installation

Via GitHub

  • Standard version:
In [1]:
#remotes::install_github("NorskRegnesentral/shapr") # Remove initial # to run
  • Ctree version
In [2]:
#remotes::install_github("NorskRegnesentral/shapr",ref = "ctree") # Remove initial # to run

From CRAN

  • Release version (available within a week or two)
In [3]:
#install.packages("shapr") # Remove initial # to run

How to use the package

  • Prepare the explanation with the shapr function
    • Input
      • Training data
      • Model to explain
      • Additional parameters related to the KernelSHAP approximation
  • Compute the Shapley values with the explain function
    • Input
      • The output from shapr
      • Data whose predictions ought to be explained (test data)
      • String specifying the approach to estimate the conditional expectations
      • The value to use as reference $\phi_0$ (typically the mean of the responses or predictions on the training data)
      • Additional specifications for the conditional expectation approach
  • Either
    • Print the table of Shapley values
    • Plot the Shapley values with plot.shapr

Example explaining a risk model for loan customers

  • Data set: Home Equity Line of Credit (HELOC) applications made by homeowners (retrived from https://community.fico.com/s/explainable-machine-learning-challenge)
  • Response: Binary

    • 1 = Bad (>90 days late with a payment)
    • 0 = Good (otherwise)
  • Examples:

    • First explain a model with only numerical featuresthen add categorical features
    • Then explain a model with both numerical and categorical features
  • Model: XGBoost model (Tree-based gradient boosting machine) + additional logistic regression model for example 2 for illustation

Data and model preparations

Loading packages and data

In [4]:
#### Package usage ####
library(shapr)
library(xgboost)
library(data.table)
library(ggplot2)

data <- data.table::fread(file = "FICO_HELOC_dataset_v1.csv")

Prepare modelling data

In [5]:
# Define response
data[,y:=(RiskPerformance=="Bad")*1]
data[,RiskPerformance:=NULL]

# NA's are coded as Negative numbers -- transform them
data[data<0]=NA

# Re-code two factor variables
data[,MaxDelq2PublicRecLast12M:=as.factor(MaxDelq2PublicRecLast12M)]
data[,MaxDelqEver:=as.factor(MaxDelqEver)]

# Define response and variables
y_var <- "y"

x_var_numeric = c("ExternalRiskEstimate",
                  "PercentTradesNeverDelq",
                  "NetFractionRevolvingBurden",
                  "NumTrades60Ever2DerogPubRec",
                  "MSinceMostRecentInqexcl7days")

x_var_factor = c("MaxDelq2PublicRecLast12M", "MaxDelqEver")

x_var_mixed = c(x_var_numeric,x_var_factor)

xy_var = c(x_var_mixed,y_var)
xy_data = data[,..xy_var]
xy_data = xy_data[complete.cases(xy_data)] # Remove observations with one or more NA in our features
In [6]:
# Split data in training and test set
set.seed(123)
test_ind = sort(sample(1:nrow(xy_data),5))
train_ind = (1:nrow(xy_data))[-test_ind]

x_train <- xy_data[train_ind,..x_var_numeric]
y_train <- unlist(xy_data[train_ind,..y_var])
x_test <- xy_data[test_ind,..x_var_numeric]

Fit XGBoost model to numerical features

In [7]:
# Fitting a basic xgboost model to the training data
params <- list(eta = 0.3,
               max_depth = 4,
               objective= 'binary:logistic',
               eval_metric = "auc",
               tree_method="hist")

model <- xgboost::xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 20,
  params = params,
  verbose = T)
[1]	train-auc:0.783454 
[2]	train-auc:0.788182 
[3]	train-auc:0.791883 
[4]	train-auc:0.794030 
[5]	train-auc:0.795987 
[6]	train-auc:0.798592 
[7]	train-auc:0.800885 
[8]	train-auc:0.802625 
[9]	train-auc:0.804006 
[10]	train-auc:0.805129 
[11]	train-auc:0.805891 
[12]	train-auc:0.807342 
[13]	train-auc:0.808248 
[14]	train-auc:0.808857 
[15]	train-auc:0.810170 
[16]	train-auc:0.811097 
[17]	train-auc:0.811691 
[18]	train-auc:0.812617 
[19]	train-auc:0.813019 
[20]	train-auc:0.813348 

The actual explanation

Prepare the explanation

In [8]:
# Prepare the data/model for explanation
explainer <- shapr(x = x_train,
                   model = model)

# Define the reference prediction for explanation
p <- mean(y_train)

Explaning the predictions

Using the empirical (conditional) distribution approach with bandwidth parameter sigma = 0.1 (default)

In [9]:
explanation_empirical <- explain(
  x = x_test,
  approach = "empirical",
  explainer = explainer,
  prediction_zero = p
)
In [10]:
# Print
explanation_empirical$dt

# Plot including phi_0
#plot(explanation_empirical)

# Plot excluding phi_0
#plot(explanation_empirical,plot_phi0 = F)
noneExternalRiskEstimatePercentTradesNeverDelqNetFractionRevolvingBurdenNumTrades60Ever2DerogPubRecMSinceMostRecentInqexcl7days
0.5139921 -0.14009418 -0.058493999-0.091652522-0.02625920 -0.10964711
0.5139922 0.03821041 0.017113909-0.007396677 0.06403613 0.06174659
0.5139921 -0.06084481 -0.063216182-0.035098324-0.02634950 -0.11334647
0.5139921 0.07013682 -0.001679264-0.128021062-0.02545825 -0.12560994
0.5139922 -0.04863755 -0.063395918-0.001233106 0.08271528 0.09613420
With phi0 Without phi0
title title

Same empirical explanation method, but increasing the bandwidth

In [11]:
## Incrasing the bandwidth parameter
explanation_empirical_02 <- explain(
  x = x_test,
  approach = "empirical",
  type = "fixed_sigma",
  fixed_sigma_vec = 0.2,
  explainer = explainer,
  prediction_zero = p
)

# Print
explanation_empirical_02$dt
noneExternalRiskEstimatePercentTradesNeverDelqNetFractionRevolvingBurdenNumTrades60Ever2DerogPubRecMSinceMostRecentInqexcl7days
0.5139921 -0.14600865 -0.056284615-0.09453238 -0.02433007 -0.10499130
0.5139922 0.04473797 0.016100449-0.01607504 0.05877593 0.07017106
0.5139921 -0.06246399 -0.064910776-0.03495893 -0.02802230 -0.10849928
0.5139921 0.07192926 -0.005466616-0.12720089 -0.02647261 -0.12342085
0.5139922 -0.04110378 -0.058163279-0.01012229 0.08391672 0.09105552

Empirical approach assuming independence

In [12]:
# Assuming independence
explanation_empirical_indep <- explain(
  x = x_test,
  approach = "empirical",
  type = "independence",
  explainer = explainer,
  prediction_zero = p
)

# Print
explanation_empirical_indep$dt
noneExternalRiskEstimatePercentTradesNeverDelqNetFractionRevolvingBurdenNumTrades60Ever2DerogPubRecMSinceMostRecentInqexcl7days
0.5139922 -0.155715216 -0.05812186 -0.1088501095-0.0087369445-0.09472289
0.5139922 0.083907371 0.02654383 0.0030868207-0.0078409022 0.06801325
0.5139922 -0.100239073 -0.05643740 -0.0369913505 0.0013489466-0.10653640
0.5139922 -0.008207541 -0.02025552 -0.0744908832 0.0004865818-0.10816433
0.5139922 0.003227606 -0.02915472 0.0007318746 0.0117219363 0.07905620

Gaussian approach

In [13]:
# Using the Gaussian approach
explanation_gaussian <- explain(
  x = x_test,
  approach = "gaussian",
  explainer = explainer,
  prediction_zero = p
)

# Print
#print(explanation_gaussian$dt)
explanation_gaussian$dt
noneExternalRiskEstimatePercentTradesNeverDelqNetFractionRevolvingBurdenNumTrades60Ever2DerogPubRecMSinceMostRecentInqexcl7days
0.5139921 -0.14974422-0.07185052-0.09352365-0.01767137-0.09335726
0.5139922 0.04799906-0.01081243-0.02393180 0.04697593 0.11347961
0.5139922 -0.07777286-0.08883438-0.05089606-0.01794015-0.06341184
0.5139921 0.04081801-0.04915794-0.10556967-0.02393681-0.07278530
0.5139922 -0.01045328-0.08859978-0.01636733 0.04919641 0.13180688

Ctree approach

In [14]:
# Using the ctree approach
explanation_ctree <- explain(
  x = x_test,
  approach = "ctree",
  explainer = explainer,
  prediction_zero = p
)

# Print
explanation_ctree$dt
noneExternalRiskEstimatePercentTradesNeverDelqNetFractionRevolvingBurdenNumTrades60Ever2DerogPubRecMSinceMostRecentInqexcl7days
0.5139921 -0.13812920 -0.05232592 -0.091897794-0.02504555 -0.11874855
0.5139922 0.05345471 0.01525332 -0.034518008 0.06594320 0.07357715
0.5139921 -0.05566900 -0.06558527 -0.034769506-0.02186155 -0.12096997
0.5139921 0.06367172 -0.01604074 -0.123537849-0.02074798 -0.11397684
0.5139922 -0.05547980 -0.05142160 -0.001148702 0.08305760 0.09057540

Turning to the mixed data setting (numerical + categorical features)

Define the data

In [15]:
x_train_mixed <- xy_data[train_ind,..x_var_mixed]
x_test_mixed <- xy_data[test_ind,..x_var_mixed]

For simplicty we first do this for a logistic regression model

In [16]:
model_glm_mixed = glm(y~.,data=cbind(x_train_mixed,y=as.factor(y_train)),family="binomial")

explainer_glm_cat <- shapr(x = x_train_mixed,
                          model = model_glm_mixed)

explanation_glm_cat_ctree <- explain(
  x = x_test_mixed,
  approach = "ctree",
  explainer = explainer_glm_cat,
  prediction_zero = p)

explanation_glm_cat_ctree$dt
noneExternalRiskEstimatePercentTradesNeverDelqNetFractionRevolvingBurdenNumTrades60Ever2DerogPubRecMSinceMostRecentInqexcl7daysMaxDelq2PublicRecLast12MMaxDelqEver
0.5139921 -0.0952231017-0.026456084 -0.074574015 -0.01317814 -0.161355101 -0.027126068 -0.02636367
0.5139922 0.0545974360-0.003901146 -0.055693930 0.02807622 0.036045539 0.003320283 -0.01404892
0.5139921 -0.0008641486-0.024777185 -0.022430062 -0.01309718 -0.003369045 -0.032145676 -0.02933691
0.5139922 -0.0044661032-0.030944132 -0.133820591 -0.01382102 -0.028310229 0.083909514 0.02429513
0.5139922 -0.0453148308-0.060816445 -0.008726553 0.04049775 0.036376869 -0.013716646 0.03263165

Mixed data with XGBoost model

  • Since XGBoost does not handle categorical data directly, we need to transform the data to numerical using one-hot encoding
  • We want to explain the original features and not the one-hot encoded features
  • We therefore need to pass the tranformation function to our explainer to allow prediction of the XGBoost model using the original data
  • We have create the special functionality to handle this:
    • Call our function shapr::make_dummies to create the transformation function
    • Get transformed data using our function shapr::apply_dummies
In [17]:
# Making the transformation function
dummylist <- make_dummies(data = rbind(x_train_mixed, x_test_mixed))

# Applying the transformation function
x_train_mixed_dummy <- apply_dummies(obj = dummylist, newdata = x_train_mixed)

# Fitting the XGBoost model
model_mixed <- xgboost(
  data = as.matrix(x_train_mixed_dummy),
  label = y_train,
  nround = 20,
  params = params,
  verbose = F)

# Include the transformation function to the model object
model_mixed$dummylist <- dummylist

Explain the model just like before (ctree only!)

In [18]:
explainer_mixed <- shapr(x_train_mixed,
                         model_mixed)

explanation_ctree_mixed <- explain(
  x_test_mixed,
  approach = "ctree",
  explainer = explainer_mixed,
  prediction_zero = p
)

explanation_ctree_mixed$dt
noneExternalRiskEstimatePercentTradesNeverDelqNetFractionRevolvingBurdenNumTrades60Ever2DerogPubRecMSinceMostRecentInqexcl7daysMaxDelq2PublicRecLast12MMaxDelqEver
0.5139921 -0.11471762 -0.02929698 -0.09727294 -0.01447551 -0.11542969 -0.029436632-0.02907514
0.5139922 0.06195969 0.02387927 -0.04003806 0.03772227 0.07045391 0.003979827 0.01671679
0.5139921 -0.04180230 -0.02707347 -0.03570163 -0.01480906 -0.11964709 -0.031520254-0.03040424
0.5139922 -0.02082199 -0.05299135 -0.11596528 -0.01593434 -0.11884721 0.082600068 0.02366128
0.5139922 -0.05694845 -0.02308505 0.01728557 0.05923145 0.09172026 -0.014275603 0.08562961