数据集源自位于比利时布鲁塞尔ULB(Université Libre de Bruxelles) 的研究小组Worldline and the Machine Learning Group。数据集包含由欧洲持卡人于2013年9月使用信用卡在两天内发生的交易,284,807笔交易中有492笔被盗刷,正类(被盗刷)占所有交易的0.172%,数据集非常不平衡。它只包含作为PCA转换结果的数字输入变量。由于保密问题,特征V1,V2,… V28是使用PCA获得的主要组件,只有“交易时间”和“交易额”是原始特征。
可以从以下几个方面来探索数据集:
识别信用卡盗刷;
不平衡样本的处理方式 尝试不同的重采样是如何影响模型的效果
模型可以尝试Logistic回归、svm、决策树、XGBoost等进行预测
模型评价标准
由于样本的不平衡性与不平衡率,推荐使用Area Under the Precision-Recall Curve (AUPRC)来衡量准确率。注意,对于非平衡样本的分类,不推荐使用混淆矩阵(Confusion matrix)进行准确率评估,因为是没有意义的。所以可以在平衡样本后使用混淆矩阵评估准确率。
## 在imbalance的correlation中,几乎找不到正关系,但是在balanced中,## V2,V4,V11,V19 是明显的正关系,V1,V3,V7,V10,V12,V14,V16,V17, ## 选择其中几种先看下正关系, V19关系最弱
f, axes = plt.subplots(ncols =4, figsize =[20,4])
sns.boxplot(x ='Class', y ='V2', data = new_df, ax = axes[0])
axes[0].set_title('the positive correlation between V2 and y')
sns.boxplot(x ='Class', y ='V4', data = new_df, ax = axes[1])
axes[1].set_title('the positive correlation between V4 and y')
sns.boxplot(x ='Class', y ='V11', data = new_df, ax = axes[2])
axes[2].set_title('the positive correlation between V11 and y')
sns.boxplot(x ='Class', y ='V19', data = new_df, ax = axes[3])
axes[3].set_title('the positive correlation between V19 and y')
plt.show()
f, axes = plt.subplots(ncols =4, figsize =[20,4])
sns.boxplot(x ='Class', y ='V10', data = new_df, ax = axes[0])
axes[0].set_title('the negative correlation between V10 and class')
sns.boxplot(x ='Class', y ='V12', data = new_df, ax = axes[1])
axes[1].set_title('the negative correlation between V12 and class')
sns.boxplot(x ='Class', y ='V14', data = new_df, ax = axes[2])
axes[2].set_title('the negative correlation between V14 and class')
sns.boxplot(x ='Class', y ='V17', data = new_df, ax = axes[3])
axes[3].set_title('the negative correlation between V17 and class')
plt.show()
weight is 1 for fraud class --
-------- Classification Report --------
precision recall f1-score support
0 1.00 1.00 1.00 99511
1 0.82 0.56 0.66 172
accuracy 1.00 99683
macro avg 0.91 0.78 0.83 99683
weighted avg 1.00 1.00 1.00 99683
weight is 5 for fraud class --
-------- Classification Report --------
precision recall f1-score support
0 1.00 1.00 1.00 99511
1 0.81 0.83 0.82 172
accuracy 1.00 99683
macro avg 0.91 0.92 0.91 99683
weighted avg 1.00 1.00 1.00 99683
weight is 10 for fraud class --
-------- Classification Report --------
precision recall f1-score support
0 1.00 1.00 1.00 99511
1 0.74 0.85 0.79 172
accuracy 1.00 99683
macro avg 0.87 0.93 0.90 99683
weighted avg 1.00 1.00 1.00 99683
weight is 50 for fraud class --
-------- Classification Report --------
precision recall f1-score support
0 1.00 1.00 1.00 99511
1 0.45 0.88 0.60 172
accuracy 1.00 99683
macro avg 0.73 0.94 0.80 99683
weighted avg 1.00 1.00 1.00 99683
weight is 100 for fraud class --
-------- Classification Report --------
precision recall f1-score support
0 1.00 1.00 1.00 99511
1 0.24 0.90 0.38 172
accuracy 0.99 99683
macro avg 0.62 0.95 0.69 99683
weighted avg 1.00 0.99 1.00 99683
weight is 500 for fraud class --
-------- Classification Report --------
precision recall f1-score support
0 1.00 0.98 0.99 99511
1 0.07 0.96 0.12 172
accuracy 0.98 99683
macro avg 0.53 0.97 0.56 99683
weighted avg 1.00 0.98 0.99 99683
weight is 1 for fraud class --
the pr score is, 0.5836083385125052
weight is 5 for fraud class --
the pr score is, 0.7824530246938387
weight is 10 for fraud class --
the pr score is, 0.7866313311565967
weight is 50 for fraud class --
the pr score is, 0.8013701456084669
weight is 100 for fraud class --
the pr score is, 0.7971764777129003
weight is 500 for fraud class --
the pr score is, 0.8026288380720867
weight is 10000 for fraud class --
the pr score is, 0.7385516882490497