AI(005) - 笔记 - 聚类性能评估(Clustering Evaluation)

论坛 期权论坛 脚本     
匿名技术用户   2020-12-21 12:42   252   0

聚类性能评估(Clustering Evaluation and Assessment)

这篇文章是对聚类性能评估的总结,对应:

  • 第四周:(10)4.10 聚类算法评估
  • 《机器学习》(西瓜书):第9章 聚类 - 9.2 性能度量
  • 维基百科(en):
    • “Cluster analysis” 词条
    • “Rand index”词条
    • “Adjusted mutual information”词条
    • “Silhouette (clustering)”词条
  • sklearn官方文档


1 聚类性能评估的一些说明

说到聚类性能比较好,就是说同一簇的样本尽可能的相似,不同簇的样本尽可能不同,即是说聚类结果“簇内相似度”(intra-cluster similarity)高,而“簇间相似度”(inter-cluster similarity)低。

聚类性能的评估(度量)分为两大类:

  • 外部评估(external evaluation):将结果与某个“参考模型”(reference model)进行比较;
  • 内部评估(internal evaluation):直接考虑聚类结果而不利用任何参考模型。

对有n个元素的数据集 D={x1,x2,⋯,xn}" id="MathJax-Element-1-Frame" role="presentation" style="position: relative;" tabindex="0">D={x1,x2,,xn}

  • 假定聚类结果: X={X1,X2,⋯,XK}" id="MathJax-Element-2-Frame" role="presentation" style="position: relative;" tabindex="0">X={X1,X2,,XK}
  • 假定参考结果: Y={Y1,Y2,⋯,YL}" id="MathJax-Element-3-Frame" role="presentation" style="position: relative;" tabindex="0">Y={Y1,Y2,,YL}

那么将样本两两配对得:

  • a=|SS|,whereSS={(xi,xj)|xi,xj∈Xk;xi,xj∈Yl}" id="MathJax-Element-4-Frame" role="presentation" style="position: relative;" tabindex="0">a=|SS|,whereSS={(xi,xj)|xi,xjXk;xi,xjYl}
  • b=|SD|,whereSD={(xi,xj)|xi,xj∈Xk;xi∈Yl1,xj∈Yl2}" id="MathJax-Element-5-Frame" role="presentation" style="position: relative;" tabindex="0">b=|SD|,whereSD={(xi,xj)|xi,xjXk;xiYl1,xjYl2}
  • c=|DS|,whereDS={(xi,xj)|xi∈Xk1,xj∈Xk2;xi,xj∈Yl}" id="MathJax-Element-6-Frame" role="presentation" style="position: relative;" tabindex="0">c=|DS|,whereDS={(xi,xj)|xiXk1,xjXk2;xi,xjYl}
  • d=|DD|,whereDD={(xi,xj)|xi∈Xk1,xj∈Xk2;xi∈Yl1,xj∈Yl2}" id="MathJax-Element-7-Frame" role="presentation" style="position: relative;" tabindex="0">d=|DD|,whereDD={(xi,xj)|xiXk1,xjXk2;xiYl1,xjYl2}

其中:

  • i≠j;1≤i,j≤n" id="MathJax-Element-8-Frame" role="presentation" style="position: relative;" tabindex="0">ij;1i,jn
  • k1≠k2;1≤k,k1,k2≤K" id="MathJax-Element-9-Frame" role="presentation" style="position: relative;" tabindex="0">k1k2;1k,k1,k2K
  • l1≠l2;1≤l,l1,l2≤L" id="MathJax-Element-10-Frame" role="presentation" style="position: relative;" tabindex="0">l1l2;1l,l1,l2L

那么所有配对的总数,即集合中可以组成样本对的对数为:

a+b+c+d=(n2)=n(n−1)2" id="MathJax-Element-11-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">a+b+c+d=(n2)=n(n1)2


2 常用外部评估(external evaluation)

2.1 Rand Index(RI) and Adjust Rand Index(ARI)

  • Rand Index

    RI=a+d(n2)=2(a+d)n(n−1)" id="MathJax-Element-63-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">RI=a+d(n2)=2(a+d)n(n1)

    显然,结果值在 [0,1]" id="MathJax-Element-64-Frame" role="presentation" style="position: relative;" tabindex="0">[0,1] 之间,且值越大越好。当为0时,两个聚类无重叠;当为1时,两个聚类完全重叠。

    但在某些聚类情况可能并不适用,从而产生了 Adjust Rand Index。

    Wiki中的原文:

    One issue with the Rand index is that false positives and false negatives are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern, as does the chance-corrected adjusted Rand index.

  • Adjust Rand Index

    ARI让RI有了修正机会(corrected-for-chance),在取值上从0到1变成了 [−1,1]" id="MathJax-Element-65-Frame" role="presentation" style="position: relative;" tabindex="0">[1,1],包含了负数(当RI小于期望值)。

    ARI=RI−E(RI)max(RI)−E(RI)" id="MathJax-Element-66-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">ARI=RIE(RI)max(RI)E(RI)

    对于X与Y的重叠可以用一个列联表(contingency table)表示,记作 [nij]" id="MathJax-Element-67-Frame" role="presentation" style="position: relative;" tabindex="0">[nij]nij=|Xi⋂Yj|" id="MathJax-Element-68-Frame" role="presentation" style="position: relative;" tabindex="0">nij=|XiYj|

    Wiki中的原文:

    • The contingency table

      Given a set S of n elements, and two groupings or partitions (e.g. clusterings) of these elements, namely X={X1,X2,…,Xr}" id="MathJax-Element-69-Frame" role="presentation" style="position: relative;" tabindex="0">X={X1,X2,,Xr} and Y={Y1,Y2,…,Ys}" id="MathJax-Element-70-Frame" role="presentation" style="position: relative;" tabindex="0">Y={Y1,Y2,,Ys}, the overlap between X and Y can be summarized in a contingency table [nij]" id="MathJax-Element-71-Frame" role="presentation" style="position: relative;" tabindex="0">[nij] where each entry nij" id="MathJax-Element-72-Frame" role="presentation" style="position: relative;" tabindex="0">nij denotes the number of objects in common between Xi" id="MathJax-Element-73-Frame" role="presentation" style="position: relative;" tabindex="0">Xi and Yj" id="MathJax-Element-74-Frame" role="presentation" style="position: relative;" tabindex="0">Yj : nij=|Xi⋂Yj|" id="MathJax-Element-75-Frame" role="presentation" style="position: relative;" tabindex="0">nij=|XiYj|.

      X∖YY1Y2⋯YsSumsX1n11n12⋯n1sa1X2n21n22⋯n2sa2⋮⋮⋮⋱⋮⋮Xrnr1nr2⋯nrsarSumsb1b2⋯bs" id="MathJax-Element-76-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">XYY1Y2YsSumsX1n11n12n1sa1X2n21n22n2sa2Xrnr1nr2nrsarSumsb1b2bs

    • Definition

      The adjusted form of the Rand Index, the Adjusted Rand Index, is

      ARI⏞Adjust Index=∑ij(nij2)⏞Index−[∑i(ai2)∑j(bj2)]/(n2)⏞Expected Index12[∑i(ai2)+∑j(bj2)]⏟Max Index−[∑i(ai2)∑j(bj2)]/(n2)⏟Expected Index" id="MathJax-Element-77-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">ARIAdjust Index=ij(nij2)Index[i(ai2)j(bj2)]/(n2)Expected Index12[i(ai2)+j(bj2)]Max Index[i(ai2)j(bj2)]/(n2)Expected Index

2.2 Adjusted Mutual Information(AMI)

  • Entropy(熵):

    H(X)=−∑k=1KP(k)log⁡P(k)whereP(k)=|Xk|n" id="MathJax-Element-27-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">H(X)=k=1KP(k)logP(k)whereP(k)=|Xk|n

    H(Y)=−∑l=1LP′(l)log⁡P′(k)whereP(l)=|Yl|n" id="MathJax-Element-28-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">H(Y)=l=1LP(l)logP(k)whereP(l)=|Yl|n

  • Mutual Information(MI)(互信息):

    MI(X,Y)=∑k=1K∑l=1LP(k,l)log⁡P(k,l)P(k)P′(l)whereP(k,l)=|Xk⋂Yl|n" id="MathJax-Element-29-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">MI(X,Y)=k=1Kl=1LP(k,l)logP(k,l)P(k)P(l)whereP(k,l)=|XkYl|n

  • MI的期望

    其中这里的 a,b,nkl" id="MathJax-Element-30-Frame" role="presentation" style="position: relative;" tabindex="0">a,b,nkl 参数,参照ARI中的Wiki原文中的矩阵。

    E{MI(X,Y)}=∑k=1K∑l=1L∑nkl=(ak+bl−n)+min(ak,bl)nklnlog⁡(n⋅nklakbl)×ak!bl!(n−ak)!(n−bl)!n!nkl!(ak−nkl)!(bl−nkl)!(n−ak−l+nkl)!" id="MathJax-Element-31-Frame" role="presentation" style="position: relative;" tabindex="0">E{MI(X,Y)}=k=1Kl=1Lnkl=(ak+bln)+min(ak,bl)nklnlog(nnklakbl)×ak!bl!(nak)!(nbl)!n!nkl!(aknkl)!(blnkl)!(nakl+nkl)!

  • Adjusted Mutual Information(AMI)(调整互信息)

    AMI(X,Y)=MI−E(MI)max(H(X),H(Y))−E(MI)" id="MathJax-Element-32-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">AMI(X,Y)=MIE(MI)max(H(X),H(Y))E(MI)

    取值范围为 [0,1]" id="MathJax-Element-33-Frame" role="presentation" style="position: relative;" tabindex="0">[0,1],同样的,两个独立聚类值为0,两种完全相同的聚类值为1。

2.3 Homogeneity,Completeness and V-measure

  • Homogeneity(同质性):一个簇是只包含一个类别的样本

    h=1−H(X|Y)H(X)" id="MathJax-Element-34-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">h=1H(X|Y)H(X)

    • 其中 H(X)" id="MathJax-Element-35-Frame" role="presentation" style="position: relative;" tabindex="0">H(X) 是聚类X的熵
    • H(X|Y)" id="MathJax-Element-36-Frame" role="presentation" style="position: relative;" tabindex="0">H(X|Y) 是给定簇分配Y条件下的X的熵:

      H(X|Y)=∑k=1K∑l=1LP(Xk,Yl)log⁡P(Yl)P(Xk,Yl)=∑k=1K∑l=1Lnklnlog⁡nkln" id="MathJax-Element-37-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">H(X|Y)=k=1Kl=1LP(Xk,Yl)logP(Yl)P(Xk,Yl)=k=1Kl=1Lnklnlognkln

  • Completeness(完整性):同类别样本被归类到相同簇中

    c=1−H(Y|X)H(Y)" id="MathJax-Element-38-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">c=1H(Y|X)H(Y)

  • V-measure:Homogeneity 和 Completeness 的调和平均

    v=2⋅h×ch+c" id="MathJax-Element-39-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">v=2h×ch+c

2.4 Fowlkes-Mallows index(FMI)

FMI是成对精度和召回率的几何均值

FMI=aa+b⋅aa+c" id="MathJax-Element-40-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">FMI=aa+baa+c

2.5 其它外部评估方法(others)

  • Jaccard Coefficient(JC)

    又称 Jaccard Index。

    J=aa+b+c" id="MathJax-Element-41-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">J=aa+b+c

  • Dice Index(DI)

    J=2a2a+b+c" id="MathJax-Element-42-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">J=2a2a+b+c


3 常用的内部评估(internal evaluation)

3.1 Silhouette coefficient(轮廓系数)

轮廓系数(侧影法)适用于实际类别信息未知的情况。对其中一个样本点i,记:

  • a(i)" id="MathJax-Element-43-Frame" role="presentation" style="position: relative;" tabindex="0">a(i):本簇中到其它所有样本点的距离的平均
  • b(i)" id="MathJax-Element-44-Frame" role="presentation" style="position: relative;" tabindex="0">b(i):到其它簇的所有样本点的平均距离的最小值

则样本点i的轮廓系数为:

s(i)=b(i)&#x2212;a(i)max{a(i),b(i)}ors(i)={1&#x2212;a(i)b(i)ifa(i)&lt;b(i)0ifa(i)=b(i)b(i)a(i)&#x2212;1ifa(i)&gt;b(i)" id="MathJax-Element-45-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">s(i)=b(i)a(i)max{a(i),b(i)}ors(i)={1a(i)b(i)ifa(i)<b(i)0ifa(i)=b(i)b(i)a(i)1ifa(i)>b(i)

所以最终s(i)的取值:

&#x2212;1&#x2264;s(i)&#x2264;1" id="MathJax-Element-46-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">1s(i)1

  • a(i)&#x226A;b(i)" id="MathJax-Element-47-Frame" role="presentation" style="position: relative;" tabindex="0">a(i)b(i) 时,无限接近于1,则意味着聚类合适;
  • a(i)&#x226B;b(i)" id="MathJax-Element-48-Frame" role="presentation" style="position: relative;" tabindex="0">a(i)b(i) 时,无限接近于-1,则意味着把样本i聚类到相邻簇中更合适;
  • a(i)&#x224A;b(i)" id="MathJax-Element-49-Frame" role="presentation" style="position: relative;" tabindex="0">a(i)b(i) 时,无限接近于0,则意味着样本在两个簇交集处。

平均Silhouette值为:

s&#x00AF;=1n&#x2211;i=1ns(i)" id="MathJax-Element-50-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">s=1ni=1ns(i)

  • s&#x00AF;&gt;0.5" id="MathJax-Element-51-Frame" role="presentation" style="position: relative;" tabindex="0">s>0.5 时,表明聚类合适;
  • s&#x00AF;&lt;0.2" id="MathJax-Element-52-Frame" role="presentation" style="position: relative;" tabindex="0">s<0.2 时,表明数据不存在聚类特征。

3.2 Calinski-Harabaz(CH)

CH也适用于实际类别信息未知的情况,以下以K-means为例,给定聚类数目K,则:

类内散度为:

W(K)=&#x2211;k=1K&#x2211;C(j)=k||xj&#x2212;x&#x00AF;k||2" id="MathJax-Element-53-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">W(K)=k=1KC(j)=k||xjxk||2

类间散度:

B(K)=&#x2211;k=1Kak||x&#x00AF;k&#x2212;x&#x00AF;||2" id="MathJax-Element-54-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">B(K)=k=1Kak||xkx||2

则CH为:

CH(K)=B(K)(N&#x2212;K)W(K)(K&#x2212;1)" id="MathJax-Element-55-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">CH(K)=B(K)(NK)W(K)(K1)

CH相对来说速度可能会更快。

3.3 其它内部评估方法(others)

  • Davies-Bouldin Index(DBI)

    记:

    • &#x03C3;i" id="MathJax-Element-56-Frame" role="presentation" style="position: relative;" tabindex="0">σi :本簇中到其它所有样本点的距离的平均;
    • ci" id="MathJax-Element-57-Frame" role="presentation" style="position: relative;" tabindex="0">ci :簇的中心;
    • d(ci,cj)" id="MathJax-Element-58-Frame" role="presentation" style="position: relative;" tabindex="0">d(ci,cj) :样本间距。

    则:

    DB=1n&#x2211;i=1nmaxj&#x2260;i(&#xA0;&#x03C3;i+&#x03C3;jd(ci,cj))" id="MathJax-Element-59-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">DB=1ni=1nmaxji( σi+σjd(ci,cj))

    DBI越小越好。

  • Dunn Index(DI)

    记:

    • d(i,j)" id="MathJax-Element-60-Frame" role="presentation" style="position: relative;" tabindex="0">d(i,j) :样本间距;
    • d&#x2032;(k)" id="MathJax-Element-61-Frame" role="presentation" style="position: relative;" tabindex="0">d(k) :本簇内样本对间的最远距离

    则:

    D=min1&#x2264;i&lt;j&#x2264;nd(i,j)max1&#x2264;k&#x2264;nd&#x2032;(k)" id="MathJax-Element-62-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">D=min1i<jnd(i,j)max1knd(k)

    DI越大越好。


4 sklearn中的评估函数

4.1 如何导入

你可以一次性把所有评估函数导入进来:

# 导入所有评估函数,不止是聚类的
from sklearn import metrics

你也可以只导入想要使用的评估函数:

# 外部评估函数
# 这些评估都是,数值越大越好

# 1 不能以字符串形式作为参数

## Homogeneity,Completeness and V-measure
from sklearn.metrics import homogeneity_completeness_v_measure

# 2 可以以字符串形式的作为参数,例如 GridSearchCV 和 cross_value_score 的 scoring 参数
#from sklearn.model_selection import GridSearchCV
#from sklearn.model_selection import cross_val_score

## ARI
from sklearn.metrics import adjusted_rand_score

## AMI
from sklearn.metrics import adjusted_mutual_info_score

## Homogeneity
from sklearn.metrics import homogeneity_score

## Completeness 
from sklearn.metrics import completeness_score

## V-measure
from sklearn.metrics import v_measure_score

## FMI
from sklearn.metrics import fowlkes_mallows_score

## NMI(前文没有提,标准化互信息聚类)
from sklearn.metrics import normalized_mutual_info_score
# 内部评估函数

## Silhouette coefficient
from sklearn.metrics import silhouette_samples

## Mean Silhouette coefficient
from sklearn.metrics import silhouette_score

## CH
from sklearn.metrics import calinski_harabaz_score

4.2 如何使用

外部评价参数,需要至少两个参数(labels_true, labels_pred)。真值和预测值。

# 外部评价方法以ARI为例,其它方法一样。只是替换函数名称即可
from sklearn.metrics import adjusted_rand_score

ARI_1 = adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 1])
ARI_2 = adjusted_rand_score([0, 0, 1, 1], [1, 1, 0, 0])
print(("(ARI_1, ARI_2):", (ARI_1, ARI_2))
## (ARI_1, ARI_2): (1.0, 1.0)

ARI_3 = adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 2])
print("ARI_3:", ARI_3)
## ARI_3: 0.5714285714285715

ARI_4 = adjusted_rand_score([0, 0, 0, 0], [0, 1, 2, 3])
print("ARI_4:", ARI_4)
## ARI_4: 0.0

内部评价参数,需要至少两个参数(X, labels)

# 内部评价方法以 K-means 为例。
from sklearn.datasets import load_digits
from sklearn.cluster import MiniBatchKMeans

train, target = load_digits(return_X_y = True)

mb_kmeans = MiniBatchKMeans()
mb_kmeans.fit(train)

ch = calinski_harabaz_score(train, mb_kmeans.predict(train))
ms = silhouette_score(train, mb_kmeans.predict(train))

print("CH score:", ch)
print("Mean Silhouette score:", ms)
## CH score: 173.19418065754013
## Mean Silhouette score: 0.1696213860110404
分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:7942463
帖子:1588486
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP