聚类性能评估(Clustering Evaluation and Assessment)
这篇文章是对聚类性能评估的总结,对应:
第四周:(10)4.10 聚类算法评估
《机器学习》(西瓜书):第9章 聚类 - 9.2 性能度量
维基百科(en):
“Cluster analysis” 词条
“Rand index”词条
“Adjusted mutual information”词条
“Silhouette (clustering)”词条
sklearn官方文档
1 聚类性能评估的一些说明
说到聚类性能比较好,就是说同一簇的样本尽可能的相似,不同簇的样本尽可能不同,即是说聚类结果“簇内相似度”(intra-cluster similarity)高,而“簇间相似度”(inter-cluster similarity)低。
聚类性能的评估(度量)分为两大类:
外部评估(external evaluation):将结果与某个“参考模型”(reference model)进行比较;
内部评估(internal evaluation):直接考虑聚类结果而不利用任何参考模型。
对有n个元素的数据集 D = { x 1 , x 2 , ⋯ , x n } " id="MathJax-Element-1-Frame" role="presentation" style="position: relative;" tabindex="0">D = { x 1 , x 2 , , x n } D = { x 1 , x 2 , , x n } :
假定聚类结果: X = { X 1 , X 2 , ⋯ , X K } " id="MathJax-Element-2-Frame" role="presentation" style="position: relative;" tabindex="0">X = { X 1 , X 2 , , X K } X = { X 1 , X 2 , , X K }
假定参考结果: Y = { Y 1 , Y 2 , ⋯ , Y L } " id="MathJax-Element-3-Frame" role="presentation" style="position: relative;" tabindex="0">Y = { Y 1 , Y 2 , , Y L } Y = { Y 1 , Y 2 , , Y L }
那么将样本两两配对得:
a = | S S | , w h e r e S S = { ( x i , x j ) | x i , x j ∈ X k ; x i , x j ∈ Y l } " id="MathJax-Element-4-Frame" role="presentation" style="position: relative;" tabindex="0">a = | S S | , w h e r e S S = { ( x i , x j ) | x i , x j ∈ X k ; x i , x j ∈ Y l } a = | S S | , w h e r e S S = { ( x i , x j ) | x i , x j ∈ X k ; x i , x j ∈ Y l }
b = | S D | , w h e r e S D = { ( x i , x j ) | x i , x j ∈ X k ; x i ∈ Y l 1 , x j ∈ Y l 2 } " id="MathJax-Element-5-Frame" role="presentation" style="position: relative;" tabindex="0">b = | S D | , w h e r e S D = { ( x i , x j ) | x i , x j ∈ X k ; x i ∈ Y l 1 , x j ∈ Y l 2 } b = | S D | , w h e r e S D = { ( x i , x j ) | x i , x j ∈ X k ; x i ∈ Y l 1 , x j ∈ Y l 2 }
c = | D S | , w h e r e D S = { ( x i , x j ) | x i ∈ X k 1 , x j ∈ X k 2 ; x i , x j ∈ Y l } " id="MathJax-Element-6-Frame" role="presentation" style="position: relative;" tabindex="0">c = | D S | , w h e r e D S = { ( x i , x j ) | x i ∈ X k 1 , x j ∈ X k 2 ; x i , x j ∈ Y l } c = | D S | , w h e r e D S = { ( x i , x j ) | x i ∈ X k 1 , x j ∈ X k 2 ; x i , x j ∈ Y l }
d = | D D | , w h e r e D D = { ( x i , x j ) | x i ∈ X k 1 , x j ∈ X k 2 ; x i ∈ Y l 1 , x j ∈ Y l 2 } " id="MathJax-Element-7-Frame" role="presentation" style="position: relative;" tabindex="0">d = | D D | , w h e r e D D = { ( x i , x j ) | x i ∈ X k 1 , x j ∈ X k 2 ; x i ∈ Y l 1 , x j ∈ Y l 2 } d = | D D | , w h e r e D D = { ( x i , x j ) | x i ∈ X k 1 , x j ∈ X k 2 ; x i ∈ Y l 1 , x j ∈ Y l 2 }
其中:
i ≠ j ; 1 ≤ i , j ≤ n " id="MathJax-Element-8-Frame" role="presentation" style="position: relative;" tabindex="0">i ≠ j ; 1 ≤ i , j ≤ n i ≠ j ; 1 ≤ i , j ≤ n
k 1 ≠ k 2 ; 1 ≤ k , k 1 , k 2 ≤ K " id="MathJax-Element-9-Frame" role="presentation" style="position: relative;" tabindex="0">k 1 ≠ k 2 ; 1 ≤ k , k 1 , k 2 ≤ K k 1 ≠ k 2 ; 1 ≤ k , k 1 , k 2 ≤ K
l 1 ≠ l 2 ; 1 ≤ l , l 1 , l 2 ≤ L " id="MathJax-Element-10-Frame" role="presentation" style="position: relative;" tabindex="0">l 1 ≠ l 2 ; 1 ≤ l , l 1 , l 2 ≤ L l 1 ≠ l 2 ; 1 ≤ l , l 1 , l 2 ≤ L
那么所有配对的总数,即集合中可以组成样本对的对数为:
a + b + c + d = ( n 2 ) = n ( n − 1 ) 2 " id="MathJax-Element-11-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">a + b + c + d = ( n 2 ) = n ( n 1 ) 2 a + b + c + d = ( n 2 ) = n ( n 1 ) 2
2 常用外部评估(external evaluation)
2.1 Rand Index(RI) and Adjust Rand Index(ARI)
Rand Index
R I = a + d ( n 2 ) = 2 ( a + d ) n ( n − 1 ) " id="MathJax-Element-63-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">R I = a + d ( n 2 ) = 2 ( a + d ) n ( n 1 ) R I = a + d ( n 2 ) = 2 ( a + d ) n ( n 1 )
显然,结果值在 [ 0 , 1 ] " id="MathJax-Element-64-Frame" role="presentation" style="position: relative;" tabindex="0">[ 0 , 1 ] [ 0 , 1 ] 之间,且值越大越好。当为0时,两个聚类无重叠;当为1时,两个聚类完全重叠。
但在某些聚类情况可能并不适用,从而产生了 Adjust Rand Index。
Wiki中的原文:
One issue with the Rand index is that false positives and false negatives are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern, as does the chance-corrected adjusted Rand index.
Adjust Rand Index
ARI让RI有了修正机会(corrected-for-chance),在取值上从0到1变成了 [ − 1 , 1 ] " id="MathJax-Element-65-Frame" role="presentation" style="position: relative;" tabindex="0">[ 1 , 1 ] [ 1 , 1 ] ,包含了负数(当RI小于期望值)。
A R I = R I − E ( R I ) max ( R I ) − E ( R I ) " id="MathJax-Element-66-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">A R I = R I E ( R I ) max ( R I ) E ( R I ) A R I = R I E ( R I ) max ( R I ) E ( R I )
对于X与Y的重叠可以用一个列联表(contingency table)表示,记作 [ n i j ] " id="MathJax-Element-67-Frame" role="presentation" style="position: relative;" tabindex="0">[ n i j ] [ n i j ] ,n i j = | X i ⋂ Y j | " id="MathJax-Element-68-Frame" role="presentation" style="position: relative;" tabindex="0">n i j = | X i Y j | n i j = | X i Y j | 。
Wiki中的原文:
The contingency table
Given a set S of n elements, and two groupings or partitions (e.g. clusterings) of these elements, namely X = { X 1 , X 2 , … , X r } " id="MathJax-Element-69-Frame" role="presentation" style="position: relative;" tabindex="0">X = { X 1 , X 2 , … , X r } X = { X 1 , X 2 , … , X r } and Y = { Y 1 , Y 2 , … , Y s } " id="MathJax-Element-70-Frame" role="presentation" style="position: relative;" tabindex="0">Y = { Y 1 , Y 2 , … , Y s } Y = { Y 1 , Y 2 , … , Y s } , the overlap between X and Y can be summarized in a contingency table [ n i j ] " id="MathJax-Element-71-Frame" role="presentation" style="position: relative;" tabindex="0">[ n i j ] [ n i j ] where each entry n i j " id="MathJax-Element-72-Frame" role="presentation" style="position: relative;" tabindex="0">n i j n i j denotes the number of objects in common between X i " id="MathJax-Element-73-Frame" role="presentation" style="position: relative;" tabindex="0">X i X i and Y j " id="MathJax-Element-74-Frame" role="presentation" style="position: relative;" tabindex="0">Y j Y j : n i j = | X i ⋂ Y j | " id="MathJax-Element-75-Frame" role="presentation" style="position: relative;" tabindex="0">n i j = | X i Y j | n i j = | X i Y j | .
X ∖ Y Y 1 Y 2 ⋯ Y s S u m s X 1 n 11 n 12 ⋯ n 1 s a 1 X 2 n 21 n 22 ⋯ n 2 s a 2 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ X r n r 1 n r 2 ⋯ n r s a r S u m s b 1 b 2 ⋯ b s " id="MathJax-Element-76-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">X Y X 1 X 2 X r S u m s Y 1 n 11 n 21 n r 1 b 1 Y 2 n 12 n 22 n r 2 b 2 Y s n 1 s n 2 s n r s b s S u m s a 1 a 2 a r X Y Y 1 Y 2 Y s S u m s X 1 n 11 n 12 n 1 s a 1 X 2 n 21 n 22 n 2 s a 2 X r n r 1 n r 2 n r s a r S u m s b 1 b 2 b s
Definition
The adjusted form of the Rand Index, the Adjusted Rand Index, is
A R I ⏞ Adjust Index = ∑ i j ( n i j 2 ) ⏞ Index − [ ∑ i ( a i 2 ) ∑ j ( b j 2 ) ] / ( n 2 ) ⏞ Expected Index 1 2 [ ∑ i ( a i 2 ) + ∑ j ( b j 2 ) ] ⏟ Max Index − [ ∑ i ( a i 2 ) ∑ j ( b j 2 ) ] / ( n 2 ) ⏟ Expected Index " id="MathJax-Element-77-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">A R I Adjust Index = ∑ i j ( n i j 2 ) Index [ ∑ i ( a i 2 ) ∑ j ( b j 2 ) ] / ( n 2 ) Expected Index 1 2 [ ∑ i ( a i 2 ) + ∑ j ( b j 2 ) ] Max Index [ ∑ i ( a i 2 ) ∑ j ( b j 2 ) ] / ( n 2 ) Expected Index A R I Adjust Index = ∑ i j ( n i j 2 ) Index [ ∑ i ( a i 2 ) ∑ j ( b j 2 ) ] / ( n 2 ) Expected Index 1 2 [ ∑ i ( a i 2 ) + ∑ j ( b j 2 ) ] Max Index [ ∑ i ( a i 2 ) ∑ j ( b j 2 ) ] / ( n 2 ) Expected Index
Entropy(熵):
H ( X ) = − ∑ k = 1 K P ( k ) log ⁡ P ( k ) w h e r e P ( k ) = | X k | n " id="MathJax-Element-27-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">H ( X ) = ∑ k = 1 K P ( k ) log P ( k ) w h e r e P ( k ) = | X k | n H ( X ) = ∑ k = 1 K P ( k ) log P ( k ) w h e r e P ( k ) = | X k | n
H ( Y ) = − ∑ l = 1 L P ′ ( l ) log ⁡ P ′ ( k ) w h e r e P ( l ) = | Y l | n " id="MathJax-Element-28-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">H ( Y ) = ∑ l = 1 L P ′ ( l ) log P ′ ( k ) w h e r e P ( l ) = | Y l | n H ( Y ) = ∑ l = 1 L P ′ ( l ) log P ′ ( k ) w h e r e P ( l ) = | Y l | n
Mutual Information(MI)(互信息):
M I ( X , Y ) = ∑ k = 1 K ∑ l = 1 L P ( k , l ) log ⁡ P ( k , l ) P ( k ) P ′ ( l ) w h e r e P ( k , l ) = | X k ⋂ Y l | n " id="MathJax-Element-29-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">M I ( X , Y ) = ∑ k = 1 K ∑ l = 1 L P ( k , l ) log P ( k , l ) P ( k ) P ′ ( l ) w h e r e P ( k , l ) = | X k Y l | n M I ( X , Y ) = ∑ k = 1 K ∑ l = 1 L P ( k , l ) log P ( k , l ) P ( k ) P ′ ( l ) w h e r e P ( k , l ) = | X k Y l | n
MI的期望
其中这里的 a , b , n k l " id="MathJax-Element-30-Frame" role="presentation" style="position: relative;" tabindex="0">a , b , n k l a , b , n k l 参数,参照ARI中的Wiki原文中的矩阵。
E { M I ( X , Y ) } = ∑ k = 1 K ∑ l = 1 L ∑ n k l = ( a k + b l − n ) + min ( a k , b l ) n k l n log ⁡ ( n ⋅ n k l a k b l ) × a k ! b l ! ( n − a k ) ! ( n − b l ) ! n ! n k l ! ( a k − n k l ) ! ( b l − n k l ) ! ( n − a k − l + n k l ) ! " id="MathJax-Element-31-Frame" role="presentation" style="position: relative;" tabindex="0">E { M I ( X , Y ) } = ∑ k = 1 K ∑ l = 1 L ∑ n k l = ( a k + b l n ) + min ( a k , b l ) n k l n log ( n n k l a k b l ) × a k ! b l ! ( n a k ) ! ( n b l ) ! n ! n k l ! ( a k n k l ) ! ( b l n k l ) ! ( n a k l + n k l ) ! E { M I ( X , Y ) } = ∑ k = 1 K ∑ l = 1 L ∑ n k l = ( a k + b l n ) + min ( a k , b l ) n k l n log ( n n k l a k b l ) × a k ! b l ! ( n a k ) ! ( n b l ) ! n ! n k l ! ( a k n k l ) ! ( b l n k l ) ! ( n a k l + n k l ) !
Adjusted Mutual Information(AMI)(调整互信息)
A M I ( X , Y ) = M I − E ( M I ) max ( H ( X ) , H ( Y ) ) − E ( M I ) " id="MathJax-Element-32-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">A M I ( X , Y ) = M I E ( M I ) max ( H ( X ) , H ( Y ) ) E ( M I ) A M I ( X , Y ) = M I E ( M I ) max ( H ( X ) , H ( Y ) ) E ( M I )
取值范围为 [ 0 , 1 ] " id="MathJax-Element-33-Frame" role="presentation" style="position: relative;" tabindex="0">[ 0 , 1 ] [ 0 , 1 ] ,同样的,两个独立聚类值为0,两种完全相同的聚类值为1。
2.3 Homogeneity,Completeness and V-measure
Homogeneity(同质性):一个簇是只包含一个类别的样本
h = 1 − H ( X | Y ) H ( X ) " id="MathJax-Element-34-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">h = 1 H ( X | Y ) H ( X ) h = 1 H ( X | Y ) H ( X )
其中 H ( X ) " id="MathJax-Element-35-Frame" role="presentation" style="position: relative;" tabindex="0">H ( X ) H ( X ) 是聚类X的熵
H ( X | Y ) " id="MathJax-Element-36-Frame" role="presentation" style="position: relative;" tabindex="0">H ( X | Y ) H ( X | Y ) 是给定簇分配Y条件下的X的熵:
H ( X | Y ) = ∑ k = 1 K ∑ l = 1 L P ( X k , Y l ) log ⁡ P ( Y l ) P ( X k , Y l ) = ∑ k = 1 K ∑ l = 1 L n k l n log ⁡ n k l n " id="MathJax-Element-37-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">H ( X | Y ) = ∑ k = 1 K ∑ l = 1 L P ( X k , Y l ) log P ( Y l ) P ( X k , Y l ) = ∑ k = 1 K ∑ l = 1 L n k l n log n k l n H ( X | Y ) = ∑ k = 1 K ∑ l = 1 L P ( X k , Y l ) log P ( Y l ) P ( X k , Y l ) = ∑ k = 1 K ∑ l = 1 L n k l n log n k l n
Completeness(完整性):同类别样本被归类到相同簇中
c = 1 − H ( Y | X ) H ( Y ) " id="MathJax-Element-38-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">c = 1 H ( Y | X ) H ( Y ) c = 1 H ( Y | X ) H ( Y )
V-measure:Homogeneity 和 Completeness 的调和平均
v = 2 ⋅ h × c h + c " id="MathJax-Element-39-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">v = 2 h × c h + c v = 2 h × c h + c
2.4 Fowlkes-Mallows index(FMI)
FMI是成对精度和召回率的几何均值
F M I = a a + b ⋅ a a + c " id="MathJax-Element-40-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">F M I = a a + b a a + c √ F M I = a a + b a a + c
2.5 其它外部评估方法(others)
Jaccard Coefficient(JC)
又称 Jaccard Index。
J = a a + b + c " id="MathJax-Element-41-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">J = a a + b + c J = a a + b + c
Dice Index(DI)
J = 2 a 2 a + b + c " id="MathJax-Element-42-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">J = 2 a 2 a + b + c J = 2 a 2 a + b + c
3 常用的内部评估(internal evaluation)
3.1 Silhouette coefficient(轮廓系数)
轮廓系数(侧影法)适用于实际类别信息未知的情况。对其中一个样本点i,记:
a ( i ) " id="MathJax-Element-43-Frame" role="presentation" style="position: relative;" tabindex="0">a ( i ) a ( i ) :本簇中到其它所有样本点的距离的平均
b ( i ) " id="MathJax-Element-44-Frame" role="presentation" style="position: relative;" tabindex="0">b ( i ) b ( i ) :到其它簇的所有样本点的平均距离的最小值
则样本点i的轮廓系数为:
s ( i ) = b ( i ) − a ( i ) max { a ( i ) , b ( i ) } o r s ( i ) = { 1 − a ( i ) b ( i ) i f a ( i ) < b ( i ) 0 i f a ( i ) = b ( i ) b ( i ) a ( i ) − 1 i f a ( i ) > b ( i ) " id="MathJax-Element-45-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">s ( i ) = b ( i ) a ( i ) max { a ( i ) , b ( i ) } o r s ( i ) = 1 a ( i ) b ( i ) 0 b ( i ) a ( i ) 1 i f i f i f a ( i ) < b ( i ) a ( i ) = b ( i ) a ( i ) > b ( i ) s ( i ) = b ( i ) a ( i ) max { a ( i ) , b ( i ) } o r s ( i ) = { 1 a ( i ) b ( i ) i f a ( i ) < b ( i ) 0 i f a ( i ) = b ( i ) b ( i ) a ( i ) 1 i f a ( i ) > b ( i )
所以最终s(i)的取值:
− 1 ≤ s ( i ) ≤ 1 " id="MathJax-Element-46-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">1 ≤ s ( i ) ≤ 1 1 ≤ s ( i ) ≤ 1
当 a ( i ) ≪ b ( i ) " id="MathJax-Element-47-Frame" role="presentation" style="position: relative;" tabindex="0">a ( i ) b ( i ) a ( i ) b ( i ) 时,无限接近于1,则意味着聚类合适;
当 a ( i ) ≫ b ( i ) " id="MathJax-Element-48-Frame" role="presentation" style="position: relative;" tabindex="0">a ( i ) b ( i ) a ( i ) b ( i ) 时,无限接近于-1,则意味着把样本i聚类到相邻簇中更合适;
当 a ( i ) ≊ b ( i ) " id="MathJax-Element-49-Frame" role="presentation" style="position: relative;" tabindex="0">a ( i ) b ( i ) a ( i ) b ( i ) 时,无限接近于0,则意味着样本在两个簇交集处。
平均Silhouette值为:
s ¯ = 1 n ∑ i = 1 n s ( i ) " id="MathJax-Element-50-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">s = 1 n ∑ i = 1 n s ( i ) s = 1 n ∑ i = 1 n s ( i )
当 s ¯ > 0.5 " id="MathJax-Element-51-Frame" role="presentation" style="position: relative;" tabindex="0">s > 0.5 s > 0.5 时,表明聚类合适;
当 s ¯ < 0.2 " id="MathJax-Element-52-Frame" role="presentation" style="position: relative;" tabindex="0">s < 0.2 s < 0.2 时,表明数据不存在聚类特征。
3.2 Calinski-Harabaz(CH)
CH也适用于实际类别信息未知的情况,以下以K-means为例,给定聚类数目K,则:
类内散度为:
W ( K ) = ∑ k = 1 K ∑ C ( j ) = k | | x j − x ¯ k | | 2 " id="MathJax-Element-53-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">W ( K ) = ∑ k = 1 K ∑ C ( j ) = k | | x j x k | | 2 W ( K ) = ∑ k = 1 K ∑ C ( j ) = k | | x j x k | | 2
类间散度:
B ( K ) = ∑ k = 1 K a k | | x ¯ k − x ¯ | | 2 " id="MathJax-Element-54-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">B ( K ) = ∑ k = 1 K a k | | x k x | | 2 B ( K ) = ∑ k = 1 K a k | | x k x | | 2
则CH为:
C H ( K ) = B ( K ) ( N − K ) W ( K ) ( K − 1 ) " id="MathJax-Element-55-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">C H ( K ) = B ( K ) ( N K ) W ( K ) ( K 1 ) C H ( K ) = B ( K ) ( N K ) W ( K ) ( K 1 )
CH相对来说速度可能会更快。
3.3 其它内部评估方法(others)
Davies-Bouldin Index(DBI)
记:
σ i " id="MathJax-Element-56-Frame" role="presentation" style="position: relative;" tabindex="0">σ i σ i :本簇中到其它所有样本点的距离的平均;
c i " id="MathJax-Element-57-Frame" role="presentation" style="position: relative;" tabindex="0">c i c i :簇的中心;
d ( c i , c j ) " id="MathJax-Element-58-Frame" role="presentation" style="position: relative;" tabindex="0">d ( c i , c j ) d ( c i , c j ) :样本间距。
则:
D B = 1 n ∑ i = 1 n max j ≠ i (   σ i + σ j d ( c i , c j ) ) " id="MathJax-Element-59-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">D B = 1 n ∑ i = 1 n max j ≠ i ( σ i + σ j d ( c i , c j ) ) D B = 1 n ∑ i = 1 n max j ≠ i ( σ i + σ j d ( c i , c j ) )
DBI越小越好。
Dunn Index(DI)
记:
d ( i , j ) " id="MathJax-Element-60-Frame" role="presentation" style="position: relative;" tabindex="0">d ( i , j ) d ( i , j ) :样本间距;
d ′ ( k ) " id="MathJax-Element-61-Frame" role="presentation" style="position: relative;" tabindex="0">d ′ ( k ) d ′ ( k ) :本簇内样本对间的最远距离
则:
D = min 1 ≤ i < j ≤ n d ( i , j ) max 1 ≤ k ≤ n d ′ ( k ) " id="MathJax-Element-62-Frame" role="presentation" style="text-align: center; position: relative;" tabindex="0">D = min 1 ≤ i < j ≤ n d ( i , j ) max 1 ≤ k ≤ n d ′ ( k ) D = min 1 ≤ i < j ≤ n d ( i , j ) max 1 ≤ k ≤ n d ′ ( k )
DI越大越好。
4 sklearn中的评估函数
4.1 如何导入
你可以一次性把所有评估函数导入进来:
from sklearn import metrics
你也可以只导入想要使用的评估函数:
from sklearn.metrics import homogeneity_completeness_v_measure
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import adjusted_mutual_info_score
from sklearn.metrics import homogeneity_score
from sklearn.metrics import completeness_score
from sklearn.metrics import v_measure_score
from sklearn.metrics import fowlkes_mallows_score
from sklearn.metrics import normalized_mutual_info_score
from sklearn.metrics import silhouette_samples
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabaz_score
4.2 如何使用
外部评价参数,需要至少两个参数(labels_true, labels_pred)。真值和预测值。
from sklearn.metrics import adjusted_rand_score
ARI_1 = adjusted_rand_score([0 , 0 , 1 , 1 ], [0 , 0 , 1 , 1 ])
ARI_2 = adjusted_rand_score([0 , 0 , 1 , 1 ], [1 , 1 , 0 , 0 ])
print(("(ARI_1, ARI_2):" , (ARI_1, ARI_2))
ARI_3 = adjusted_rand_score([0 , 0 , 1 , 1 ], [0 , 0 , 1 , 2 ])
print("ARI_3:" , ARI_3)
ARI_4 = adjusted_rand_score([0 , 0 , 0 , 0 ], [0 , 1 , 2 , 3 ])
print("ARI_4:" , ARI_4)
内部评价参数,需要至少两个参数(X, labels)。
from sklearn.datasets import load_digits
from sklearn.cluster import MiniBatchKMeans
train, target = load_digits(return_X_y = True )
mb_kmeans = MiniBatchKMeans()
mb_kmeans.fit(train)
ch = calinski_harabaz_score(train, mb_kmeans.predict(train))
ms = silhouette_score(train, mb_kmeans.predict(train))
print("CH score:" , ch)
print("Mean Silhouette score:" , ms)