credit risk 预测建模 - try 1

目标：信用评分，对个人借贷进行风险评估。

一、数据预处理

导入数据

自变量-连续型	V2,V5,V8,V11,V13,V16,V18
自变量-分类型	V1,V3,V4,V6,V7,V9,V10,V12,V14,V15,V17,V19,V20
因变量y	V21
变量释义	https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

* 数据下载见：变量释义中的链接

R程序：

rawdata = read.table("D:/personal/knowledge/dataMining/dataset/german/german.data",header=F)

rawdata$y <- as.factor(rawdata$V21)

rawdata$V21 <- NULL

str(rawdata)

数据准备

训练数据	从总样本中抽样600条
验证数据	剩余的400条

R程序：

trainIdx <- sample(nrow(rawdata), round(0.6*nrow(rawdata)))

traindata <- rawdata[trainIdx,]

validdata <- rawdata[-trainIdx,]

1、数据清洗（data cleaning）

（1）缺失值处理（missingdata processing）

无缺失值。

（2）连续数据离散化（data discretization）

使用WoE进行离散化处理，见建模阶段处理。

（3）去噪声（noisy dataprocessing）

（未有时间研究）

（4）去异常值（outlierprocessing）

?

（5）共线性变量处理（pairwisecorrelations processing）

VIF （未有时间研究）

2、数据集成（data integration）

单一数据来源，数据结构也一致。无需再集成。

3、数据变换（data transformation）

（1）规范化处理

使用WoE进行离散化处理，见建模阶段处理。

二、模型选择

1、GLM-logistic回归（GLM logistic regression）

（1）WoE建模（Modeling）

我们结合使用信用评分卡中的WoE（Weight of Evidence证据权重）对连续型变量进行离散化处理。

R程序：

woemodel <- woe(y~., data = traindata, zeroadj=0.5, appont = TRUE)

# 需安装klaR包，install.packages("klaR")

（2）IV检验（Examine）

IV（Information Value 信息价值）检验，检验标准如下：

Information Value	Predictive Power
< 0.02	useless for prediction
0.02 to 0.1	Weak predictor
0.1 to 0.3	Medium predictor
0.3 to 0.5	Strong predictor
>0.5	too good to be true

R程序：

woemodel

结果：

IV

V1 0.672439277

V3 0.284116679

V6 0.223761533

V4 0.149480263

V7 0.119616049

V10 0.092531065

V12 0.085246908

V15 0.070580379

V20 0.061258525

V14 0.054065776

V9 0.041359709

V17 0.008511956

V19 0.001861789

通过结果观测，我们发现<0.02：V17, V19，>0.5：V1。

V1: Status of existing checking account

V17: Job

V19: Telephone

由此得知，V1, V17, V19都不应直接放入模型。（就这样就行?）

（3）logistic建模（Modeling）

Logistic Regression with Weight of Evidence。

R程序：

woedata <- predict(woemodel, traindata, replace = TRUE)

woedata$woe.V1 <- NULL

woedata$woe.V17 <- NULL

woedata$woe.V19 <- NULL

str(woedata)

logit.glm <- glm(y~., family=binomial, data=woedata)

（4）z统计量及AIC检验（Examine）

R程序：

summary(logit.glm)

结果：

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.328e+00 6.965e-01 -3.342 0.000832 ***

V2 2.890e-02 1.114e-02 2.594 0.009487 **

V5 1.055e-04 4.838e-05 2.180 0.029264 *

V8 3.001e-01 1.035e-01 2.898 0.003756 **

V11 1.164e-01 1.023e-01 1.138 0.255123

V13 -3.320e-02 1.105e-02 -3.005 0.002654 **

V16 1.193e-02 2.028e-01 0.059 0.953095

V18 3.748e-01 3.042e-01 1.232 0.218022

woe.V3 -1.068e+00 2.168e-01 -4.926 8.41e-07 ***

woe.V4 -1.233e+00 2.803e-01 -4.399 1.09e-05 ***

woe.V6 -1.022e+00 2.362e-01 -4.326 1.52e-05 ***

woe.V7 -6.759e-01 3.190e-01 -2.118 0.034140 *

woe.V9 -1.472e+00 5.540e-01 -2.658 0.007862 **

woe.V10 -9.178e-01 3.602e-01 -2.548 0.010827 *

woe.V12 9.430e-02 4.012e-01 0.235 0.814189

woe.V14 -8.667e-01 4.421e-01 -1.960 0.049953 *

woe.V15 -5.409e-01 4.103e-01 -1.318 0.187396

woe.V20 -1.480e+00 7.809e-01 -1.895 0.058054 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 747.65 on 599 degrees of freedom

Residual deviance: 581.37 on 582 degrees of freedom

AIC: 617.37

通过结果观测，我们发现V20大于0.1显著性水平，V11、V16、V18、V12、V15大于0.05显著性水平，这些变量接受原假设，对因变量信用风险无显著影响。

V11：Present residence since

V12：Property

V15：Housing

V16：Number of existing credits at this bank

V18：Number of people being liable to provide maintenance for

V20：foreign worker

（尼妈，property、housing都没影响?!）

AIC值为617.37，后面逐步回归时及模型比较时会用上。

（5）逐步回归建模（Modeling）

我们使用逐步回归分析来解决参数检验不显著的情况，应用 stepwise logistic regression。

R程序：

logit.glm.step <- step(logit.glm, direction="both")

（6）z统计量及AIC检验（Examine）

R程序：

summary(logit.glm.step)

结果：

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.612e+00 5.013e-01 -3.215 0.00130 **

V2 2.845e-02 1.090e-02 2.611 0.00904 **

V5 1.009e-04 4.766e-05 2.117 0.03425 *

V8 2.880e-01 1.022e-01 2.817 0.00484 **

V13 -2.969e-02 1.073e-02 -2.768 0.00564 **

woe.V3 -1.048e+00 2.023e-01 -5.183 2.19e-07 ***

woe.V4 -1.261e+00 2.786e-01 -4.527 5.99e-06 ***

woe.V6 -9.894e-01 2.338e-01 -4.231 2.33e-05 ***

woe.V7 -5.970e-01 3.113e-01 -1.918 0.05514 .

woe.V9 -1.264e+00 5.276e-01 -2.396 0.01657 *

woe.V10 -8.695e-01 3.502e-01 -2.483 0.01304 *

woe.V14 -8.312e-01 4.385e-01 -1.896 0.05801 .

woe.V15 -6.759e-01 3.853e-01 -1.754 0.07940 .

woe.V20 -1.491e+00 7.786e-01 -1.915 0.05550 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 747.65 on 599 degrees of freedom

Residual deviance: 584.34 on 586 degrees of freedom

AIC: 612.34

在逐步回归之后，V11、V12、V16、V18去掉，V15、V20保留。各参数全部通过显著性检验。同时，AIC为612.34，小于原来的617.37，表明优先考虑逐步回归后的模型。

（7）其它检验（Examine）

ROC/AUC、Gini检验（后补）

2、GAM-logistic回归（GAM logistic regression）

（后补）

3、模型比较（Model comparison）

（后补）

4、Scorecards构建（Scorecards）

（怎搞??? 最大的疑问~）

上两个公式：

（1）各个属性评分

woe=ln(odds)，beita为回归系数，altha为截距，n为变量个数，offset为偏移量（视风险偏好而定），比例因子factor。

（2）总评分

比例因子和偏移量都是人为设定，还是反计算所得？

5、模型验证（Model validation）

R程序：

validWoeData <- predict(woemodel, validdata, replace = TRUE)

pred.val <- predict(logit.glm.step, validWoeData, type = "response")

pred.val

结果（前16条）：

5 7 12 13 14 15 17 23
0.357798810 0.075791812 0.837024202 0.225547085 0.095280357 0.528561890 0.025320823 0.006696470
25 31 32 40 42 44 46 47
0.002720358 0.161846210 0.512595515 0.247351390 0.179962491 0.146126291 0.303658983 0.134170936

（怎么看这个结果?）

三、模型预测

从模型验证（Model validation）中抽取记录当作预测。

credit risk 预测建模 - try 1

浏览过的版块