Introduction to shapr¶

An R-package for explaining predictive models using dependence-aware Shapley values¶

Main developers: Martin Jullum, Nikolai Sellereite & Annabelle Redelmeier¶

About the package¶

Package website: https://norskregnesentral.github.io/shapr/
- Includes package information, introductions, examples etc.
Source code: https://github.com/NorskRegnesentral/shapr
Package paper (Sellereite & Jullum, JOSS, 2020) https://joss.theoj.org/papers/10.21105/joss.02027

Installation¶

Via GitHub¶

Standard version:

#remotes::install_github("NorskRegnesentral/shapr") # Remove initial # to run

Ctree version

#remotes::install_github("NorskRegnesentral/shapr",ref = "ctree") # Remove initial # to run

From CRAN¶

Release version (available within a week or two)

#install.packages("shapr") # Remove initial # to run

How to use the package¶

Prepare the explanation with the shapr function
- Input
  - Training data
  - Model to explain
  - Additional parameters related to the KernelSHAP approximation
Compute the Shapley values with the explain function
- Input
  - The output from shapr
  - Data whose predictions ought to be explained (test data)
  - String specifying the approach to estimate the conditional expectations
  - The value to use as reference $\phi_0$ (typically the mean of the responses or predictions on the training data)
  - Additional specifications for the conditional expectation approach
Either
- Print the table of Shapley values
- Plot the Shapley values with plot.shapr

Example explaining a risk model for loan customers¶

Data set: Home Equity Line of Credit (HELOC) applications made by homeowners (retrived from https://community.fico.com/s/explainable-machine-learning-challenge)
Response: Binary
- 1 = Bad (>90 days late with a payment)
- 0 = Good (otherwise)
Examples:
- First explain a model with only numerical featuresthen add categorical features
- Then explain a model with both numerical and categorical features
Model: XGBoost model (Tree-based gradient boosting machine) + additional logistic regression model for example 2 for illustation

Data and model preparations¶

Loading packages and data¶

#### Package usage ####
library(shapr)
library(xgboost)
library(data.table)
library(ggplot2)

data <- data.table::fread(file = "FICO_HELOC_dataset_v1.csv")

Prepare modelling data¶

# Define response
data[,y:=(RiskPerformance=="Bad")*1]
data[,RiskPerformance:=NULL]

# NA's are coded as Negative numbers -- transform them
data[data<0]=NA

# Re-code two factor variables
data[,MaxDelq2PublicRecLast12M:=as.factor(MaxDelq2PublicRecLast12M)]
data[,MaxDelqEver:=as.factor(MaxDelqEver)]

# Define response and variables
y_var <- "y"

x_var_numeric = c("ExternalRiskEstimate",
                  "PercentTradesNeverDelq",
                  "NetFractionRevolvingBurden",
                  "NumTrades60Ever2DerogPubRec",
                  "MSinceMostRecentInqexcl7days")

x_var_factor = c("MaxDelq2PublicRecLast12M", "MaxDelqEver")

x_var_mixed = c(x_var_numeric,x_var_factor)

xy_var = c(x_var_mixed,y_var)
xy_data = data[,..xy_var]
xy_data = xy_data[complete.cases(xy_data)] # Remove observations with one or more NA in our features

# Split data in training and test set
set.seed(123)
test_ind = sort(sample(1:nrow(xy_data),5))
train_ind = (1:nrow(xy_data))[-test_ind]

x_train <- xy_data[train_ind,..x_var_numeric]
y_train <- unlist(xy_data[train_ind,..y_var])
x_test <- xy_data[test_ind,..x_var_numeric]

Fit XGBoost model to numerical features¶

# Fitting a basic xgboost model to the training data
params <- list(eta = 0.3,
               max_depth = 4,
               objective= 'binary:logistic',
               eval_metric = "auc",
               tree_method="hist")

model <- xgboost::xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 20,
  params = params,
  verbose = T)

[1]	train-auc:0.783454 
[2]	train-auc:0.788182 
[3]	train-auc:0.791883 
[4]	train-auc:0.794030 
[5]	train-auc:0.795987 
[6]	train-auc:0.798592 
[7]	train-auc:0.800885 
[8]	train-auc:0.802625 
[9]	train-auc:0.804006 
[10]	train-auc:0.805129 
[11]	train-auc:0.805891 
[12]	train-auc:0.807342 
[13]	train-auc:0.808248 
[14]	train-auc:0.808857 
[15]	train-auc:0.810170 
[16]	train-auc:0.811097 
[17]	train-auc:0.811691 
[18]	train-auc:0.812617 
[19]	train-auc:0.813019 
[20]	train-auc:0.813348

The actual explanation¶

Prepare the explanation¶

# Prepare the data/model for explanation
explainer <- shapr(x = x_train,
                   model = model)

# Define the reference prediction for explanation
p <- mean(y_train)

Explaning the predictions¶

Using the empirical (conditional) distribution approach with bandwidth parameter sigma = 0.1 (default)¶

explanation_empirical <- explain(
  x = x_test,
  approach = "empirical",
  explainer = explainer,
  prediction_zero = p
)

Print and plot the Shapley values¶

# Print
explanation_empirical$dt

# Plot including phi_0
#plot(explanation_empirical)

# Plot excluding phi_0
#plot(explanation_empirical,plot_phi0 = F)

With phi0	Without phi0

Same empirical explanation method, but increasing the bandwidth¶

## Incrasing the bandwidth parameter
explanation_empirical_02 <- explain(
  x = x_test,
  approach = "empirical",
  type = "fixed_sigma",
  fixed_sigma_vec = 0.2,
  explainer = explainer,
  prediction_zero = p
)

# Print
explanation_empirical_02$dt

Empirical approach assuming independence¶

# Assuming independence
explanation_empirical_indep <- explain(
  x = x_test,
  approach = "empirical",
  type = "independence",
  explainer = explainer,
  prediction_zero = p
)

# Print
explanation_empirical_indep$dt

Gaussian approach¶

# Using the Gaussian approach
explanation_gaussian <- explain(
  x = x_test,
  approach = "gaussian",
  explainer = explainer,
  prediction_zero = p
)

# Print
#print(explanation_gaussian$dt)
explanation_gaussian$dt

Ctree approach¶

# Using the ctree approach
explanation_ctree <- explain(
  x = x_test,
  approach = "ctree",
  explainer = explainer,
  prediction_zero = p
)

# Print
explanation_ctree$dt

Turning to the mixed data setting (numerical + categorical features)¶

Define the data¶

x_train_mixed <- xy_data[train_ind,..x_var_mixed]
x_test_mixed <- xy_data[test_ind,..x_var_mixed]

For simplicty we first do this for a logistic regression model¶

model_glm_mixed = glm(y~.,data=cbind(x_train_mixed,y=as.factor(y_train)),family="binomial")

explainer_glm_cat <- shapr(x = x_train_mixed,
                          model = model_glm_mixed)

explanation_glm_cat_ctree <- explain(
  x = x_test_mixed,
  approach = "ctree",
  explainer = explainer_glm_cat,
  prediction_zero = p)

explanation_glm_cat_ctree$dt

Mixed data with XGBoost model¶

Since XGBoost does not handle categorical data directly, we need to transform the data to numerical using one-hot encoding
We want to explain the original features and not the one-hot encoded features
We therefore need to pass the tranformation function to our explainer to allow prediction of the XGBoost model using the original data
We have create the special functionality to handle this:
- Call our function shapr::make_dummies to create the transformation function
- Get transformed data using our function shapr::apply_dummies

# Making the transformation function
dummylist <- make_dummies(data = rbind(x_train_mixed, x_test_mixed))

# Applying the transformation function
x_train_mixed_dummy <- apply_dummies(obj = dummylist, newdata = x_train_mixed)

# Fitting the XGBoost model
model_mixed <- xgboost(
  data = as.matrix(x_train_mixed_dummy),
  label = y_train,
  nround = 20,
  params = params,
  verbose = F)

# Include the transformation function to the model object
model_mixed$dummylist <- dummylist

Explain the model just like before (ctree only!)¶

explainer_mixed <- shapr(x_train_mixed,
                         model_mixed)

explanation_ctree_mixed <- explain(
  x_test_mixed,
  approach = "ctree",
  explainer = explainer_mixed,
  prediction_zero = p
)

explanation_ctree_mixed$dt

none	ExternalRiskEstimate	PercentTradesNeverDelq	NetFractionRevolvingBurden	NumTrades60Ever2DerogPubRec	MSinceMostRecentInqexcl7days
0.5139921	-0.14009418	-0.058493999	-0.091652522	-0.02625920	-0.10964711
0.5139922	0.03821041	0.017113909	-0.007396677	0.06403613	0.06174659
0.5139921	-0.06084481	-0.063216182	-0.035098324	-0.02634950	-0.11334647
0.5139921	0.07013682	-0.001679264	-0.128021062	-0.02545825	-0.12560994
0.5139922	-0.04863755	-0.063395918	-0.001233106	0.08271528	0.09613420

none	ExternalRiskEstimate	PercentTradesNeverDelq	NetFractionRevolvingBurden	NumTrades60Ever2DerogPubRec	MSinceMostRecentInqexcl7days
0.5139921	-0.14600865	-0.056284615	-0.09453238	-0.02433007	-0.10499130
0.5139922	0.04473797	0.016100449	-0.01607504	0.05877593	0.07017106
0.5139921	-0.06246399	-0.064910776	-0.03495893	-0.02802230	-0.10849928
0.5139921	0.07192926	-0.005466616	-0.12720089	-0.02647261	-0.12342085
0.5139922	-0.04110378	-0.058163279	-0.01012229	0.08391672	0.09105552

none	ExternalRiskEstimate	PercentTradesNeverDelq	NetFractionRevolvingBurden	NumTrades60Ever2DerogPubRec	MSinceMostRecentInqexcl7days
0.5139922	-0.155715216	-0.05812186	-0.1088501095	-0.0087369445	-0.09472289
0.5139922	0.083907371	0.02654383	0.0030868207	-0.0078409022	0.06801325
0.5139922	-0.100239073	-0.05643740	-0.0369913505	0.0013489466	-0.10653640
0.5139922	-0.008207541	-0.02025552	-0.0744908832	0.0004865818	-0.10816433
0.5139922	0.003227606	-0.02915472	0.0007318746	0.0117219363	0.07905620

none	ExternalRiskEstimate	PercentTradesNeverDelq	NetFractionRevolvingBurden	NumTrades60Ever2DerogPubRec	MSinceMostRecentInqexcl7days
0.5139921	-0.14974422	-0.07185052	-0.09352365	-0.01767137	-0.09335726
0.5139922	0.04799906	-0.01081243	-0.02393180	0.04697593	0.11347961
0.5139922	-0.07777286	-0.08883438	-0.05089606	-0.01794015	-0.06341184
0.5139921	0.04081801	-0.04915794	-0.10556967	-0.02393681	-0.07278530
0.5139922	-0.01045328	-0.08859978	-0.01636733	0.04919641	0.13180688

none	ExternalRiskEstimate	PercentTradesNeverDelq	NetFractionRevolvingBurden	NumTrades60Ever2DerogPubRec	MSinceMostRecentInqexcl7days
0.5139921	-0.13812920	-0.05232592	-0.091897794	-0.02504555	-0.11874855
0.5139922	0.05345471	0.01525332	-0.034518008	0.06594320	0.07357715
0.5139921	-0.05566900	-0.06558527	-0.034769506	-0.02186155	-0.12096997
0.5139921	0.06367172	-0.01604074	-0.123537849	-0.02074798	-0.11397684
0.5139922	-0.05547980	-0.05142160	-0.001148702	0.08305760	0.09057540

none	ExternalRiskEstimate	PercentTradesNeverDelq	NetFractionRevolvingBurden	NumTrades60Ever2DerogPubRec	MSinceMostRecentInqexcl7days	MaxDelq2PublicRecLast12M	MaxDelqEver
0.5139921	-0.0952231017	-0.026456084	-0.074574015	-0.01317814	-0.161355101	-0.027126068	-0.02636367
0.5139922	0.0545974360	-0.003901146	-0.055693930	0.02807622	0.036045539	0.003320283	-0.01404892
0.5139921	-0.0008641486	-0.024777185	-0.022430062	-0.01309718	-0.003369045	-0.032145676	-0.02933691
0.5139922	-0.0044661032	-0.030944132	-0.133820591	-0.01382102	-0.028310229	0.083909514	0.02429513
0.5139922	-0.0453148308	-0.060816445	-0.008726553	0.04049775	0.036376869	-0.013716646	0.03263165

none	ExternalRiskEstimate	PercentTradesNeverDelq	NetFractionRevolvingBurden	NumTrades60Ever2DerogPubRec	MSinceMostRecentInqexcl7days	MaxDelq2PublicRecLast12M	MaxDelqEver
0.5139921	-0.11471762	-0.02929698	-0.09727294	-0.01447551	-0.11542969	-0.029436632	-0.02907514
0.5139922	0.06195969	0.02387927	-0.04003806	0.03772227	0.07045391	0.003979827	0.01671679
0.5139921	-0.04180230	-0.02707347	-0.03570163	-0.01480906	-0.11964709	-0.031520254	-0.03040424
0.5139922	-0.02082199	-0.05299135	-0.11596528	-0.01593434	-0.11884721	0.082600068	0.02366128
0.5139922	-0.05694845	-0.02308505	0.01728557	0.05923145	0.09172026	-0.014275603	0.08562961