kaggle(二):最大利润问题

论坛 期权论坛 脚本     
匿名技术用户   2020-12-27 05:40   309   0

这是一个监督学习求解最大利润的题目。给很多人去放款贷款,目的是预测这些贷款的人会不会还款;如果还,标签为1,说明银行预测正确,可以得到利润;如果不还,标签为0,银行不可以得到利润。模型预测之后,和真实的标签去对比,评估模型的好坏。

这道题牵扯到了比kaggle(一)更多的属性特征和样本数,(二)更多的数据清洗操作;(三)模型评估指标的应用。

# coding: utf-8
import pandas as pd

load_2007 = pd.read_csv("LoanStats3a.csv",skiprows=1)   #读取文件
len(load_2007)   #查看有多少row,行数

load_2007.shape   #查看数组的二维特征,(rows,columns)
print(load_2007.shape[0])   #(rows)
print(load_2007.shape[1])   #(columns)
#删除缺失值 根据缺失值占总样本数多少的删除掉 缺失率
#当删除行时,axis = 0, 如果这一个样本有一半的特征都没有数据。
#当删除特征时,axis = 1,如果这一个特征有一半的样本都没有数据。

#设定阈值
half_count_column = int(load_2007.shape[0]/2)   #计算一半的样本个数
half_count_row = int(load_2007.shape[1]/2)   #计算一半的属性个数
print(half_count_column)

load_2007 = load_2007.dropna(thresh=half_count,axis = 1)  #保留column的数据,如果column至少有一半不等于na,axis=1按照column取
load_2007.shape

load_2007 = load_2007.drop(['desc','url'],axis = 1) #axie = 1 column, axis = 0,index
load_2007.to_csv('load_2007.csv',index = False)
print(load_2007.iloc[1,:])   #索引的用法 loc和iloc loc[row_name:row_name,column_name:column_name]根据名称切  iloc[1:2,1:2]根据index切

load_2007.columns.values   #查看columns的名称,输出是一个list形式,名称为str的格式
#把一些和loans无关的去掉 drop("column_name",axis=1)

load_2007 = load_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)

load_2007 = load_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

load_2007 = load_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)

load_2007.shape[1]
#pandas 计1  计算公式1
print(load_2007['loan_status'].value_counts())  #统计这个columns每个分属性的个数
#pandas 查1  #|号是或  查找公式1
load_2007 = load_2007[(load_2007['loan_status'] == "Fully Paid") | (load_2007['loan_status'] == "Charged Off")]
load_2007.shape  #bool条件判断,提取某列的值为True的行
#pandas 改1  修改公式1 series的map函数可以传入字典格式,通过键值修改
loans_status = {"Fully Paid":1,"Charged Off":0}   #series map函数替换某列的值
load_2007["loan_status"] = load_2007["loan_status"].map(loans_status)
load_2007.head(6)
#pandas查2  查找公式2 如果这个column中全都是一个数,删除
# load_2007.columns 这个输出的是index 感觉用法和df.columns.values一样。。。
column_name = load_2007.columns
drop_column = []
for col in column_name:
    col_series = load_2007[col].dropna().unique()
    if len(col_series) == 1:
        drop_column.append(col)
        load_2007 = load_2007.drop([col],axis = 1)
print(drop_column)
# orig_columns = load_2007.columns
# drop_columns = []
# for col in orig_columns:
#     col_series = load_2007[col].dropna().unique()
#     if len(col_series) == 1:
#         drop_columns.append(col)
# load_2007 = load_2007.drop(drop_columns, axis=1)
# print(drop_columns)
# print (load_2007.shape)    
#查看缺失值的情况
load_2007.isnull().sum()  #统计每列的缺失值

loads = load_2007.drop(["pub_rec_bankruptcies"],axis = 1)  #把含有很多na的列去掉
loads = loads.dropna(axis = 0)  #把含有na的行去掉
print(loads.dtypes.value_counts()) #统计有几个object的特征,有几个float类型的特征,有几个int类型的特征
#根据数据类型选择columns
object_columns_df = loads.select_dtypes(include = ["object"])  #中括号别忘记
print(object_columns_df.iloc[1])


#查看每个object里面分多少小的类
#需要把字符型全部转化成数值型
col_object = ['home_ownership','verification_status','purpose','addr_state']
for i in col_object:
    print(object_columns_df[i].value_counts())


print(object_columns_df["purpose"].value_counts())
object_columns_df.head(3)
#字典进行map,其实replace也可以,我习惯用字典,思路简单,理解方便
mapping_emp_length = {
    "10+ years": 10,
    "9 years": 9,
    "8 years":8,
    "7 years":7,
    "6 years":6,
    "5 years":5,
    "4 years":4,
    "3 years":3,
    "2 years":2,
    "1 years":1,
    "<1 years":0,
    "n/a":0
}
loads =loads.drop(["last_credit_pull_d","earliest_cr_line","addr_state","title"],axis = 1)
loads["int_rate"] = loads["int_rate"].str.strip("%").astype(float)   #去掉%号
loads["revol_util"] = loads["revol_util"].str.strip("%").astype(float)
loads["emp_length"] = loads["emp_length"].map(mapping_emp_length)

loads.head(10)
#发现na没用啊,只能自己手动填充了
loads["emp_length"] = loads["emp_length"].fillna(value = 0)  #把0填充入na
loads.head(10)
#one-hot处理
cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dumies= pd.get_dummies(loads[cat_columns])
loads = pd.concat([loads, dumies],axis = 1)
loads = loads.drop(cat_columns,axis = 1)  #labels : single label or list-like
loads = loads.drop("pymnt_plan", axis=1)

loads.info()

模型评估

模型评估:那么,模型分析完后,每个样本row会有一个1,0的数字,代表样本的是或者否。然后样本有一个真实的1或者0.对比预测的1或者0和实际的1或者0,进行分析。对吧。

注意,回归问题和分类问题的评估指标不一样。回归问题是求解超平面和样本点之间的距离的,相对简单:比如,MSE(均方差损失)。而分类问题是看分类模型好坏的。

混淆矩阵可以了解一下:我的理解是True/False是代表你的预测结果,Positive/Negative是预测值。

预测值实际值预测结果
11True Positive(预测为True,预测的结果是Positive,真实值是Positive)
10False Positive (预测为False,预测的结果是Positive,真实值是Negative)
00True Negative (预测为False,预测结果为Nagative,真实值是Negative)
01False Negative (预测为False,预测结果为Negative,真实值为Positive)

如何运用这四个指标呢?TPR ture positive rate和FPR false positive rate。

Precision(查准率)和Recall(查全率)(英文的感觉容易理解)。

Precision:是模型预测的正样本中预测正确的比例,也就是分子是:预测正确的正样本数 分母:预测正确的正样本+预测错误的正样本,取值越大,预测效果越好。 TP/TP + FP

Recall(TPR):模型预测正确的正样本占总体正样本的比例。分子:预测正确的正样本 分母:真实值中的正样本,TP + FN(预测错误的预测负样本)。取值越大,预测效果越好。TP/TP + FN

FPR:FP/FP + TN 模型预测正确的错误样本占总体错误样本的比例。

那么,我们希望有一个高的TPR和低的FPR。


# from sklearn.linear_model import LogisticRegression
# lr = LogisticRegression()
loans_result = loans["loan_status"]
loans_test = loans.drop("loan_status",axis=1)
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_predict, KFold

lr = LogisticRegression()
kf = KFold(loans_test.shape[0], random_state= 1)
predictions = cross_val_predict(lr, loans_test, loans_result, cv=kf)
predictions = pd.Series(predictions)  #把list转换成series,可以让pandas操作

#TN
tn = (predictions == 0) & (loans["loan_status"] == 0)
tn_num = len(predictions[tn])

#TP
tp = (predictions == 1) & (loans["loan_status"] == 1)
tp_num = len(predictions[tp])

#FP
fp = (predictions == 1) & (loans["loan_status"] == 0)
fp_num = len(predictions[fp])

#FN
fn = (predictions == 0) & (loans["loan_status"] == 1)
fn_num = len(predictions[fn])

#rate
fpr = fp_num / float((fp_num + tn_num))
tpr = tp_num / float((tp_num + fn_num))

print(tpr)
print(fpr)
print(predictions[:20])

0.9988995955270045
0.9991050653302309
tpr和fpr都很大,很明显不满足,为什么会这样呢?
查看一下loans_result
loans_result.value_counts()
#1 33859
#0 5639
可以看出,正样本比负样本多出很多,这样会造成样本不均衡,回过头来想一下在模型调优那里
如果样本不均衡,正样本多一点点的话,要下采样;正样本多很多,要重新多收集数据,或者对负样本过采样,采样时记得随机采样和分层采样
还有一种方法,就是给正样本和负样本不同的权重


#给权重
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_predict, KFold

lr = LogisticRegression(class_weight="balanced") #这里是调权重参数,class_weight= 是grid search插入位点
kf = KFold(loans_test.shape[0], random_state= 1)
predictions = cross_val_predict(lr, loans_test, loans_result, cv=kf)
predictions = pd.Series(predictions)
#TN
tn = (predictions == 0) & (loans["loan_status"] == 0)
tn_num = len(predictions[tn])

#TP
tp = (predictions == 1) & (loans["loan_status"] == 1)
tp_num = len(predictions[tp])

#FP
fp = (predictions == 1) & (loans["loan_status"] == 0)
fp_num = len(predictions[fp])

#FN
fn = (predictions == 0) & (loans["loan_status"] == 1)
fn_num = len(predictions[fn])

#rate
fpr = fp_num / float((fp_num + tn_num))
tpr = tp_num / float((tp_num + fn_num))

print(tpr)
print(fpr)
print(predictions[:20])

0.6353794908398763
0.6207266869518525
#可以传入字典形式
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold
penalty = {
    0: 5,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
kf = KFold(loans_test[0], random_state=1)
predictions = cross_val_predict(lr, loans_test, loans_result, cv=kf)
predictions = pd.Series(predictions)

#TN
tn = (predictions == 0) & (loans["loan_status"] == 0)
tn_num = len(predictions[tn])

#TP
tp = (predictions == 1) & (loans["loan_status"] == 1)
tp_num = len(predictions[tp])

#FP
fp = (predictions == 1) & (loans["loan_status"] == 0)
fp_num = len(predictions[fp])

#FN
fn = (predictions == 0) & (loans["loan_status"] == 1)
fn_num = len(predictions[fn])

#rate
fpr = fp_num / float((fp_num + tn_num))
tpr = tp_num / float((tp_num + fn_num))

print(tpr)
print(fpr)
print(predictions[:20])
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict
rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)
#print help(RandomForestClassifier)
kf = KFold(loans_test.shape[0], random_state=1)
predictions = cross_val_predict(rf, loans_test, loans_result, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)

分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:7942463
帖子:1588486
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP