# 实战！机器学习模型可解释性–预测世界杯当场最佳

https://zhuanlan.zhihu.com/p/71803324

》介绍了机器学习可解释性的基本概念，那么今天我们就来看看如何具体的利用这些可解释的工具来对一个真实的模型进行可解释性的分析。

https://github.com/gangtao/ml-Interpretability

import numpy as np

import pandas as pd

import warnings

warnings.filterwarnings('ignore')

data.head()

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier

y = (data['Man of the Match'])

feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]

X = data[feature_names]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

my_model_1 = RandomForestClassifier(random_state=0).fit(train_X, train_y)

my_model_2 = DecisionTreeClassifier(random_state=0).fit(train_X, train_y)

from sklearn import tree

import graphviz

tree_graph = tree.export_graphviz(my_model_2, out_file=None, feature_names=feature_names)

graphviz.Source(tree_graph)

Permutation Importance

Permutation Importance提供了一个和模型无关的计算特征重要性的方法。
Permutation的中文含义是排列的意思，该算法基本思路如下：

• 选择一个特征
• 在数据集上对该特征的所有值进行随机排列
• 计算新的预测结果
• 如果新旧结果的差异不大那么说明该特征重要性低，如果新旧结果差异显著，那么说明该特征对模型的影响也是比较显著的。

import eli5

from eli5.sklearn import PermutationImportance

perm = PermutationImportance(my_model_1, random_state=1).fit(val_X, val_y)

eli5.show_weights(perm, feature_names = val_X.columns.tolist())

Partial Dependency Plot

PDP的基本思路就是控制其它所有特征不变，改变要分析的特征，看看它对预测结果的影响。
ICE和PDP类似，ICE会显示所有实例上的分析结果。

from matplotlib import pyplot as plt

from pdpbox import pdp, get_dataset, info_plots

pdp_goals = pdp.pdp_isolate(model=my_model_1, dataset=val_X, model_features=feature_names, feature='Goal Scored')

pdp.pdp_plot(pdp_goals, 'Goal Scored', plot_pts_dist=True)

plt.show()

feature_to_plot = 'Distance Covered (Kms)'

pdp_dist = pdp.pdp_isolate(model=my_model_1, dataset=val_X, model_features=feature_names, feature=feature_to_plot)

pdp.pdp_plot(pdp_dist, feature_to_plot, plot_pts_dist=True)

plt.show()

Sklearn也提供了PDP的支持

import matplotlib.pyplot as plt

from sklearn.inspection import plot_partial_dependence

plot_partial_dependence(my_model_1, train_X,

['Goal Scored','Ball Possession %', 'Distance Covered (Kms)' , 'Corners', (0,1)],

feature_names, grid_resolution=50)

fig = plt.gcf()

fig.set_figheight(10)

fig.set_figwidth(10)

fig.suptitle('Partial dependence')

plt.subplots_adjust(top=0.9, bottom = 0.1, wspace = 0.8)  # tight_layout causes overlap with suptitle

PDP最多可以分析两个特征。

### Sharpley Value

Sharpley Value

PDP一般只针对某一个特征进行分析，最多两个，我们可以看出当分析两个特征的时候，PDP图已经不是一目了然的清楚了。
Sharpley Value可以针对某一个数据实例，对所有的特征对预测的贡献作出分析。

Goal Scored                 2

Ball Possession %          38

Attempts                   13

On-Target                   7

Off-Target                  4

Blocked                     2

Corners                     6

Offsides                    1

Free Kicks                 18

Saves                       1

Pass Accuracy %            69

Passes                    399

Distance Covered (Kms)    148

Fouls Committed            25

Yellow Card                 1

Yellow & Red                0

Red                         0

Goals in PSO                3

Name: 118, dtype: int64
row_to_show = 5

data_for_prediction = val_X.iloc[row_to_show]  # use 1 row of data here. Could use multiple rows if desired

data_for_prediction_array = data_for_prediction.values.reshape(1, -1)

pred_1 = my_model_1.predict_proba(data_for_prediction_array)

pred_2 = my_model_2.predict_proba(data_for_prediction_array)

pred_1,pred_2
(array([[0.3, 0.7]]), array([[0., 1.]]))

import shap # package used to calculate Shap values

explainer = shap.TreeExplainer(my_model_2)

shap_values = explainer.shap_values(data_for_prediction)

shap.initjs()

shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)

KernelExplainer可以针对一般的模型进行Sharpley Value的分析，但是运算要慢一些。

k_explainer = shap.KernelExplainer(my_model_1.predict_proba, train_X)

k_shap_values = k_explainer.shap_values(data_for_prediction)

shap.initjs()

shap.force_plot(k_explainer.expected_value[1], k_shap_values[1], data_for_prediction)

Summary Plot给出了所有数据点的分析汇总

shap_values = k_explainer.shap_values(val_X)

shap.initjs()

shap.summary_plot(shap_values[1], val_X)

explainer = shap.TreeExplainer(my_model_1)

# Calculate Shap values

shap_values = explainer.shap_values(X)

# make plot.

shap.dependence_plot('Ball Possession %', shap_values[1], X, interaction_index="Goal Scored")

LIME

LIME 全称是 local interpretable model-agnostic explanations 直译是局部可解释的模型无关的解释，非常拗口。
LIME针对某个实例，假定在局部，模型是简单的线性模型，对该数据点作出解释。

import lime

import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(train_X, feature_names=feature_names,

class_names=['No','Yes'],

discretize_continuous=False)

train_sample = train_X.sample(n=1)

pred_p_1 = my_model_1.predict_proba(train_sample.values)

pred_p_2 = my_model_2.predict_proba(train_sample.values)

pred_1 = my_model_1.predict(train_sample.values)

pred_2 = my_model_2.predict(train_sample.values)

pred_p_1,pred_1,pred_p_2,pred_2

exp = explainer.explain_instance(train_sample.values[0],

my_model_2.predict_proba,

num_features=len(feature_names),

top_labels=1)

exp.show_in_notebook(show_table=True, show_all=False)

Binder

Binder

### 参考

• https://github.com/marcotcr/lime
• https://github.com/TeamHG-Memex/eli5
• https://github.com/SauceCat/PDPbox
• https://github.com/AustinRochford/PyCEbox
• https://github.com/slundberg/shap

https://zhuanlan.zhihu.com/p/71803324

Photo by
Patrick Tomasso
on Unsplash