<p><img alt="" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-28b8883a1ecb994d6a6dff22dc489833.png"></p>
<p> </p>
<p> CDA数据分析师 出品 </p>
<p>作者:真达、Mika</p>
<p>数据:真达 </p>
<p>【导读】</p>
<p>今天教大家如何用Python写一个电信用户流失预测模型。之前我们用Python写了员工流失预测模型,这次我们试试Python预测电信用户的流失。</p>
<p><strong>01、商业理解</strong></p>
<p>流失客户是指那些曾经使用过产品或服务,由于对产品失去兴趣等种种原因,不再使用产品或服务的顾客。</p>
<p>电信服务公司、互联网服务提供商、保险公司等经常使用客户流失分析和客户流失率作为他们的关键业务指标之一,因为留住一个老客户的成本远远低于获得一个新客户。</p>
<p>预测分析使用客户流失预测模型,通过评估客户流失的风险倾向来预测客户流失。由于这些模型生成了一个流失概率排序名单,对于潜在的高概率流失客户,他们可以有效地实施客户保留营销计划。</p>
<p>下面我们就教你如何用Python写一个电信用户流失预测模型,以下是具体步骤和关键代码。</p>
<p><strong>02、数据理解</strong></p>
<p>此次分析数据来自于IBM Sample Data Sets,统计自某电信公司一段时间内的消费数据。共有7043笔客户资料,每笔客户资料包含21个字段,其中1个客户ID字段,19个输入字段及1个目标字段-Churn(Yes代表流失,No代表未流失),输入字段主要包含以下三个维度指标:用户画像指标、消费产品指标、消费信息指标。字段的具体说明如下:</p>
<p><img alt="" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-e027f90b588605f64a7b980efb4bb4ae.png"></p>
<p><strong>03、数据读入和概览</strong></p>
<p>首先导入所需包。</p>
<pre class="blockcode"># 数据处理
import numpy as np
import pandas as pd
# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
# 前处理
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
# 建模
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
# 模型评估
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import roc_auc_score, roc_curve, scorer
from sklearn.metrics import recall_score, precision_score, f1_score, cohen_kappa_score
pd.set_option('display.max_columns', None)
</pre>
<p>读入数据集</p>
<pre class="blockcode">df = pd.read_csv('./Telco-Customer-Churn.csv')
df.head() </pre>
<p><img alt="" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-6e0f678920f490564176be4267d2997a.png"></p>
<p><strong>04、数据初步清洗</strong></p>
<p>首先进行初步的数据清洗工作,包含错误值和异常值处理,并划分类别型和数值型字段类型,其中清洗部分包含:</p>
<ul><li>OnlineSecurity、OnlineBackup、DeviceProtection、TechSupport、StreamingTV、StreamingMovies:错误值处理</li><li>TotalCharges:异常值处理</li><li>tenure:自定义分箱</li><li>定义类别型和数值型字段</li></ul>
<pre class="blockcode"># 错误值处理
repl_columns = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport','StreamingTV', 'StreamingMovies']
for i in repl_columns:
df[i] = df[i].replace({'No internet service' : 'No'})
# 替换值SeniorCitizen
df["SeniorCitizen"] = df["SeniorCitizen"].replace({1: "Yes", 0: "No"})
# 替换值TotalCharges
df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan)
# TotalCharges空值:数据量小,直接删除
df = df.dropna(subset=['TotalCharges'])
df.reset_index(drop=True, inplace=True) # 重置索引
# 转换数据类型
df['TotalCharges'] = df['TotalCharges'].astype('float')
# 转换tenure
def transform_tenure(x):
if x <= 12:
return 'Tenure_1'
elif x <= 24:
return 'Tenure_2'
elif x <= 36:
return 'Tenure_3'
elif x <= 48:
return 'Tenure_4'
elif x <= 60:
return 'Tenure_5'
else:
return 'Tenure_over_5'
df['tenure_group'] = df.tenure.apply(transform_tenure)
# 数值型和类别型字段
Id_col = ['customerID']
target_col = ['Churn']
cat_cols = df.nunique()[df.nunique() < 10].index.tolist()
num_co |
|