AI（005） - 笔记 - 聚类性能评估（Clustering Evaluation）

聚类性能评估（Clustering Evaluation and Assessment）

这篇文章是对聚类性能评估的总结，对应：

第四周：（10）4.10 聚类算法评估
《机器学习》（西瓜书）：第9章聚类 - 9.2 性能度量
维基百科（en）：
- “Cluster analysis” 词条
- “Rand index”词条
- “Adjusted mutual information”词条
- “Silhouette (clustering)”词条
sklearn官方文档

聚类性能评估（Clustering Evaluation and Assessment）

1 聚类性能评估的一些说明

说到聚类性能比较好，就是说同一簇的样本尽可能的相似，不同簇的样本尽可能不同，即是说聚类结果“簇内相似度”（intra-cluster similarity）高，而“簇间相似度”（inter-cluster similarity）低。

聚类性能的评估（度量）分为两大类：

外部评估（external evaluation）：将结果与某个“参考模型”（reference model）进行比较；
内部评估（internal evaluation）：直接考虑聚类结果而不利用任何参考模型。

对有n个元素的数据集 $D = {x_{1}, x_{2},, x_{n}}$ ：

假定聚类结果： $X = {X_{1}, X_{2},, X_{K}}$
假定参考结果： $Y = {Y_{1}, Y_{2},, Y_{L}}$

那么将样本两两配对得：

$a = | S S |, w h e r e S S = {(x_{i}, x_{j}) | x_{i}, x_{j} \in X_{k}; x_{i}, x_{j} \in Y_{l}}$
$b = | S D |, w h e r e S D = {(x_{i}, x_{j}) | x_{i}, x_{j} \in X_{k}; x_{i} \in Y_{l_{1}}, x_{j} \in Y_{l_{2}}}$
$c = | D S |, w h e r e D S = {(x_{i}, x_{j}) | x_{i} \in X_{k_{1}}, x_{j} \in X_{k_{2}}; x_{i}, x_{j} \in Y_{l}}$
$d = | D D |, w h e r e D D = {(x_{i}, x_{j}) | x_{i} \in X_{k_{1}}, x_{j} \in X_{k_{2}}; x_{i} \in Y_{l_{1}}, x_{j} \in Y_{l_{2}}}$

其中：

$i \neq j; 1 \leq i, j \leq n$
$k_{1} \neq k_{2}; 1 \leq k, k_{1}, k_{2} \leq K$
$l_{1} \neq l_{2}; 1 \leq l, l_{1}, l_{2} \leq L$

那么所有配对的总数，即集合中可以组成样本对的对数为：

a + b + c + d = (\begin{matrix} n \\ 2 \end{matrix}) = \frac{n (n 1)}{2}

2 常用外部评估（external evaluation）

2.1 Rand Index（RI） and Adjust Rand Index（ARI）

Rand Index

$R I = \frac{a + d}{(\begin{matrix} n \\ 2 \end{matrix})} = \frac{2 (a + d)}{n (n 1)}$

显然，结果值在 $[0, 1]$ 之间，且值越大越好。当为0时，两个聚类无重叠；当为1时，两个聚类完全重叠。

但在某些聚类情况可能并不适用，从而产生了 Adjust Rand Index。

Wiki中的原文：

One issue with the Rand index is that false positives and false negatives are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern, as does the chance-corrected adjusted Rand index.
Adjust Rand Index

ARI让RI有了修正机会（corrected-for-chance），在取值上从0到1变成了 $[1, 1]$ ，包含了负数（当RI小于期望值）。

$A R I = \frac{R I E (R I)}{max (R I) E (R I)}$

对于X与Y的重叠可以用一个列联表（contingency table）表示，记作 $[n_{i j}]$ ， $n_{i j} = | X_{i} Y_{j} |$ 。
Wiki中的原文：
- The contingency table
  
  Given a set S of n elements, and two groupings or partitions (e.g. clusterings) of these elements, namely $X = {X_{1}, X_{2}, \dots, X_{r}}$ and $Y = {Y_{1}, Y_{2}, \dots, Y_{s}}$ , the overlap between X and Y can be summarized in a contingency table $[n_{i j}]$ where each entry $n_{i j}$ denotes the number of objects in common between $X_{i}$ and $Y_{j}$ : $n_{i j} = | X_{i} Y_{j} |$ .
  
  $\begin{array}{ccc} X Y & Y_{1} & Y_{2} & Y_{s} & S u m s \\ X_{1} & n_{11} & n_{12} & n_{1 s} & a_{1} \\ X_{2} & n_{21} & n_{22} & n_{2 s} & a_{2} \\ X_{r} & n_{r 1} & n_{r 2} & n_{r s} & a_{r} \\ S u m s & b_{1} & b_{2} & b_{s} \end{array}$
- Definition
  
  The adjusted form of the Rand Index, the Adjusted Rand Index, is
  
  $\overset{Adjust Index}{\overset{}{A R I}} = \frac{\overset{Index}{\overset{}{\sum_{i j} (\begin{matrix} n_{i j} \\ 2 \end{matrix})}} \overset{Expected Index}{\overset{}{[\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix})}}}{\underset{Max Index}{\underset{}{\frac{1}{2} [\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) + \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})]}} \underset{Expected Index}{\underset{}{[\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix})}}}$

2.2 Adjusted Mutual Information（AMI）

Entropy（熵）：

$H (X) = \sum_{k = 1}^{K} P (k) \log P (k) w h e r e P (k) = \frac{| X_{k} |}{n}$

$H (Y) = \sum_{l = 1}^{L} P^{'} (l) \log P^{'} (k) w h e r e P (l) = \frac{| Y_{l} |}{n}$
Mutual Information（MI）（互信息）：

$M I (X, Y) = \sum_{k = 1}^{K} \sum_{l = 1}^{L} P (k, l) \log \frac{P (k, l)}{P (k) P^{'} (l)} w h e r e P (k, l) = \frac{| X_{k} Y_{l} |}{n}$
MI的期望

其中这里的 $a, b, n_{k l}$ 参数，参照ARI中的Wiki原文中的矩阵。

$E {M I (X, Y)} = \sum_{k = 1}^{K} \sum_{l = 1}^{L} \sum_{n_{k l} = (a_{k} + b_{l} n)^{+}}^{min (a_{k}, b_{l})} \frac{n_{k l}}{n} \log (\frac{n n_{k l}}{a_{k} b_{l}}) \times \frac{a_{k}! b_{l}! (n a_{k})! (n b_{l})!}{n! n_{k l}! (a_{k} n_{k l})! (b_{l} n_{k l})! (n a_{k}_{l} + n_{k l})!}$
Adjusted Mutual Information（AMI）（调整互信息）

$A M I (X, Y) = \frac{M I E (M I)}{max (H (X), H (Y)) E (M I)}$

取值范围为 $[0, 1]$ ，同样的，两个独立聚类值为0，两种完全相同的聚类值为1。

2.3 Homogeneity，Completeness and V-measure

Homogeneity（同质性）：一个簇是只包含一个类别的样本

$h = 1 \frac{H (X | Y)}{H (X)}$
- 其中 $H (X)$ 是聚类X的熵
- $H (X | Y)$ 是给定簇分配Y条件下的X的熵：
  
  $H (X | Y) = \sum_{k = 1}^{K} \sum_{l = 1}^{L} P (X_{k}, Y_{l}) \log \frac{P (Y_{l})}{P (X_{k}, Y_{l})} = \sum_{k = 1}^{K} \sum_{l = 1}^{L} \frac{n_{k l}}{n} \log \frac{n_{k l}}{n}$
Completeness（完整性）：同类别样本被归类到相同簇中

$c = 1 \frac{H (Y | X)}{H (Y)}$
V-measure：Homogeneity 和 Completeness 的调和平均

$v = 2 \frac{h \times c}{h + c}$

2.4 Fowlkes-Mallows index（FMI）

FMI是成对精度和召回率的几何均值

F M I = \sqrt{\frac{a}{a + b} \frac{a}{a + c}}

2.5 其它外部评估方法（others）

Jaccard Coefficient（JC）

又称 Jaccard Index。

$J = \frac{a}{a + b + c}$
Dice Index（DI）

$J = \frac{2 a}{2 a + b + c}$

3 常用的内部评估（internal evaluation）

3.1 Silhouette coefficient（轮廓系数）

轮廓系数（侧影法）适用于实际类别信息未知的情况。对其中一个样本点i，记：

$a (i)$ ：本簇中到其它所有样本点的距离的平均
$b (i)$ ：到其它簇的所有样本点的平均距离的最小值

则样本点i的轮廓系数为：

s (i) = \frac{b (i) a (i)}{max {a (i), b (i)}} o r s (i) = {\begin{cases} 1 \frac{a (i)}{b (i)} & i f & a (i) < b (i) \\ 0 & i f & a (i) = b (i) \\ \frac{b (i)}{a (i)} 1 & i f & a (i) > b (i) \end{cases}

所以最终s(i)的取值：

1 \leq s (i) \leq 1

当 $a (i) b (i)$ 时，无限接近于1，则意味着聚类合适；
当 $a (i) b (i)$ 时，无限接近于-1，则意味着把样本i聚类到相邻簇中更合适；
当 $a (i) b (i)$ 时，无限接近于0，则意味着样本在两个簇交集处。

平均Silhouette值为：

\overset{}{s} = \frac{1}{n} \sum_{i = 1}^{n} s (i)

当 $\overset{}{s} > 0.5$ 时，表明聚类合适；
当 $\overset{}{s} < 0.2$ 时，表明数据不存在聚类特征。

3.2 Calinski-Harabaz（CH）

CH也适用于实际类别信息未知的情况，以下以K-means为例，给定聚类数目K，则：

类内散度为：

W (K) = \sum_{k = 1}^{K} \sum_{C (j) = k} | | x_{j} {\overset{}{x}}_{k} | |^{2}

类间散度：

B (K) = \sum_{k = 1}^{K} a_{k} | | {\overset{}{x}}_{k} \overset{}{x} | |^{2}

则CH为：

C H (K) = \frac{B (K) (N K)}{W (K) (K 1)}

CH相对来说速度可能会更快。

3.3 其它内部评估方法（others）

Davies-Bouldin Index（DBI）

记：
- $σ_{i}$ ：本簇中到其它所有样本点的距离的平均；
- $c_{i}$ ：簇的中心；
- $d (c_{i}, c_{j})$ ：样本间距。
则：

$D B = \frac{1}{n} \sum_{i = 1}^{n} max_{j \neq i} (\frac{σ_{i} + σ_{j}}{d (c_{i}, c_{j})})$

DBI越小越好。
Dunn Index（DI）

记：
- $d (i, j)$ ：样本间距；
- $d^{'} (k)$ ：本簇内样本对间的最远距离
则：

$D = \frac{min_{1 \leq i < j \leq n} d (i, j)}{max_{1 \leq k \leq n} d^{'} (k)}$

DI越大越好。

4 sklearn中的评估函数

4.1 如何导入

你可以一次性把所有评估函数导入进来：

# 导入所有评估函数，不止是聚类的
from sklearn import metrics

你也可以只导入想要使用的评估函数：

# 外部评估函数
# 这些评估都是，数值越大越好

# 1 不能以字符串形式作为参数

## Homogeneity，Completeness and V-measure
from sklearn.metrics import homogeneity_completeness_v_measure

# 2 可以以字符串形式的作为参数，例如 GridSearchCV 和 cross_value_score 的 scoring 参数
#from sklearn.model_selection import GridSearchCV
#from sklearn.model_selection import cross_val_score

## ARI
from sklearn.metrics import adjusted_rand_score

## AMI
from sklearn.metrics import adjusted_mutual_info_score

## Homogeneity
from sklearn.metrics import homogeneity_score

## Completeness 
from sklearn.metrics import completeness_score

## V-measure
from sklearn.metrics import v_measure_score

## FMI
from sklearn.metrics import fowlkes_mallows_score

## NMI（前文没有提，标准化互信息聚类）
from sklearn.metrics import normalized_mutual_info_score

# 内部评估函数

## Silhouette coefficient
from sklearn.metrics import silhouette_samples

## Mean Silhouette coefficient
from sklearn.metrics import silhouette_score

## CH
from sklearn.metrics import calinski_harabaz_score

4.2 如何使用

外部评价参数，需要至少两个参数(labels_true, labels_pred)。真值和预测值。

# 外部评价方法以ARI为例，其它方法一样。只是替换函数名称即可
from sklearn.metrics import adjusted_rand_score

ARI_1 = adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 1])
ARI_2 = adjusted_rand_score([0, 0, 1, 1], [1, 1, 0, 0])
print(("(ARI_1, ARI_2):", (ARI_1, ARI_2))
## (ARI_1, ARI_2): (1.0, 1.0)

ARI_3 = adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 2])
print("ARI_3:", ARI_3)
## ARI_3: 0.5714285714285715

ARI_4 = adjusted_rand_score([0, 0, 0, 0], [0, 1, 2, 3])
print("ARI_4:", ARI_4)
## ARI_4: 0.0

内部评价参数，需要至少两个参数(X, labels)。

# 内部评价方法以 K-means 为例。
from sklearn.datasets import load_digits
from sklearn.cluster import MiniBatchKMeans

train, target = load_digits(return_X_y = True)

mb_kmeans = MiniBatchKMeans()
mb_kmeans.fit(train)

ch = calinski_harabaz_score(train, mb_kmeans.predict(train))
ms = silhouette_score(train, mb_kmeans.predict(train))

print("CH score:", ch)
print("Mean Silhouette score:", ms)
## CH score: 173.19418065754013
## Mean Silhouette score: 0.1696213860110404