kaggle房价预测特征意思_Kaggle美国某地房价预测

项目来自Kaggle房价预测

作者：陈浩

1.提出问题（Business Understanding）

本次研究的问题是： 什么样的房子价格会更高？

2.理解数据(Data Understanding)

我们将理解数据分为3个步骤：

采集数据：本次使用的数据已经上传在Kaggle上，我们只需要下载就可以得到；
导入数据：将scv格式的数据导入到python的数据结构中（DataFrame）；
查看数据信息，准备下一步处理。

2.1采集数据

下载数据

2.2导入数据

为了方便同时对训练数据和测试数据进行清洗，我们将两个数据集合并。

#忽略警告
import warnings
warnings.filterwarnings('ignore')

#导入数据处理包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#导入绘图包
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid',{'font.sans-serif':['simhei','Arial']})

from scipy import stats
from scipy.stats import norm, skew
#导入数据
#训练数据集
train = pd.read_csv('C:Usershphouse price pridicttrain.csv')
#测试数据集
test = pd.read_csv('C:Usershphouse price pridicttest.csv')
print('训练数据集：',train.shape,'测试数据集',test.shape)
#记住训练数据和测试数据的大小，便于后续分析
train_shape = train.shape[0]
test_shape = test.shape[0]

训练数据集： (1460, 81) 测试数据集 (1459, 80)

2.3查看数据集信息

2.3.1 总体情况

#查看前5行数据
train.head()

根据Kaggle提供的信息，每列代表的意义展示如下：

SalePrice - 房价，这也是我们预测的对象

MSSubClass: 建筑类别

MSZoning: The general zoning classification（空间分类）

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet（平方英尺）

Street: Type of road access（应该是临近街道的意思）

Alley: Type of alley access（靠近那条巷子之类的建筑）

LotShape: General shape of property（应该是房子的户型之类的）

LandContour: Flatness of the property（房子的平整度，就是房子本身和外界高度是否一致）

Utilities: Type of utilities available（泛指提供的公共设施及水电煤气种类，可能全部包含也有只包含水电的）

LotConfig: Lot configuration（应该是可供使用的土地面积的分布样式，可能分布在房子周围，也可能是前方的一个院子）

LandSlope: Slope of property（房子的坡度）

Neighborhood: Physical locations within Ames city limits（艾姆斯市范围内的实际位置）

Condition1: Proximity to main road or railroad（距离主要道路和铁路的距离）

Condition2: Proximity to main road or railroad (if a second is present)（当附近有第二条主要道路和铁路时，距离该主要道路或者铁路的距离）

BldgType: Type of dwelling（住宅的类别）

HouseStyle: Style of dwelling（住宅的风格） ................................................

实际上我们需要根据数据的分布选择出与目标（SalePrice）相关性大的列进行分析，可以根据选择进行深入的了解，全面的理解每一列的中文意思没有必要而且浪费精力，plus我有点懒，所以并没有列出所有列。

#查看统计信息
train.describe()

笼统的纵观全局并不能得到什么有用的信息，那么，我们单独分析一下目标的统计信息：

2.3.2 目标值简单分析

train['SalePrice'].describe()

可见：

目标值的最值之间差异（755000-34900）=720100，还是很大的；
75%的价格分布在214000之下，50%的房价分布在129965~214000之间；
标准差79442.5，说明分布不是很稳定。

由以上3点，我们有理由相信存在极端值影响了整体数值的分布，在之后的分析中应该注意去掉这些极端值。

我们尝试用分布曲线拟合目标的分布，使用Seaborn自带的distplot()函数可以实现这个功能：

sns.distplot(train['SalePrice'],fit= norm)

'''
为了更好的拟合曲线，我们首先对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
然后使用PP图将理论的正态分布图和实际的分布图作对比，看看真实分布能不能视作正态分布
'''
#得到最接近真实分布的正态分布的参数：mu——期望；sigma——标准差
(mu,sigma) = norm.fit(train['SalePrice'])
print('n mu = {:.2f} n sigma = {:.2f} n'.format(mu,sigma))
#对图像参数指定
plt.legend(['正态分布 mu = {:.2f}  sigma = {:.2f}'.format(mu,sigma)],
           loc = 'best')
plt.ylabel('频率')
plt.xlabel('SalePrice')
plt.show()

mu = 180921.20 sigma = 79415.29

#绘制PP图，观察目标分布与理论正态分布的区别
fig = plt.figure()
res = stats.probplot(train['SalePrice'],plot = plt)
plt.show()

很明显，目标分布和正态分布相差很多。由于大多数机器学习的模型对正态分布的数值有更准确的预测，因此我们需要将目标值转化成正态分布。

其实就像高中物理中分析运动一样，我们对非匀速运动不好分析，但是我们将速度数据进行求导处理，得出了加速的这样的参数，就可以更好的处理匀加速匀减速运动了。

#我们使用numpy进行对数运算，看看结果的分布
train['SalePrice'] = np.log1p(train['SalePrice'])

#查看新的分布
sns.distplot(train['SalePrice'],fit = norm)
plt.show()

#再次查看PP图
stats.probplot(train['SalePrice'], plot=plt)
plt.show()

这样看来目标值就很接近正态分布。

我们用这样的目标值进行预测，得出的结果做一次逆运算就可以得到我们想要的结果了。

2.3.3 缺失值情况

#合并数据集，方便数据清洗
full = train.append(test,ignore_index = True)
full.shape
(2919, 81)
#判定是否有缺失值
full_null = full.isnull()
#计算每列缺失值数量
full_null = full_null.sum()
#挑选有缺失的列
full_null= full_null[full_null > 0]
full_null.sort_values(inplace = True)
full_null.shape

(35,)

#可视化缺失值
full_null.plot.bar(grid = False)
plt.show()

共有35项出现缺失值，观察数据发现很多列的空值其实是表示‘没有’的意思，我们直接填充“None”即可。

至此，我们可以进行填充缺失值等数据清洗的工作了。

3.数据清洗（Data Preparation）

3.1 数据预处理

3.1.1 缺失值处理

首先，我们填充‘None’!

'''
Alley: Type of alley access to property

       NA  No alley access
BsmtQual: Evaluates the height of the basement

       NA   No Basement

BsmtCond: Evaluates the general condition of the basement

       NA   No Basement

BsmtExposure: Refers to walkout or garden level walls

       NA   No Basement

BsmtFinType1: Rating of basement finished area

       NA   No Basement

BsmtFinType2: Rating of basement finished area (if multiple types)

       NA   No Basement

FireplaceQu: Fireplace quality

       NA   No Fireplace

GarageType: Garage location

       NA   No Garage

GarageFinish: Interior finish of the garage

       NA   No Garage

GarageQual: Garage quality

       NA   No Garage

GarageCond: Garage condition

       NA   No Garage

PoolQC: Pool quality

       NA   No Pool

Fence: Fence quality

       NA   No Fence

MiscFeature: Miscellaneous feature not covered in other categories   
       NA   None
       对于这些列，将NA填充为None
'''
cols = ['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2'
        ,'FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PoolQC',
        'Fence','MiscFeature']
for i in cols:
    full[i] = full[i].fillna('None')

查看一下填充后的情况：

full_null = full.isnull()
#计算每列缺失值数量
full_null = full_null.sum()
#挑选有缺失的列
full_null= full_null[full_null > 0]
full_null.sort_values(inplace = True)
full_null.shape
(21,)
full_null

'''
1. LotFrontage：Linear feet of street connected to property

表示房子临近的街道，参考kaggle上大神的思路：居住在一起的人LotFrontage是相似的，我们将邻居的LotFrontage填充进来；

'''
full['LotFrontage'] = full.groupby('Neighborhood')['LotFrontage'].transform(lambda x:x.fillna(x.median()))

'''
2. GarageYrBlt：Year garage was built
   GarageArea：Size of garage in square feet
   GarageCars：Size of garage in car capacity

分别表示车库的建造年份面积，车位大小，这个值缺失很可能表示没有车库，我们用0填充；
GarageArea 
'''
full['GarageYrBlt'] = full['GarageYrBlt'].fillna(0)
full['GarageArea'] = full['GarageArea'].fillna(0)
full['GarageCars'] = full['GarageCars'].fillna(0)
'''
3. MasVnrType：Masonry veneer type

表示墙面砖的类型。缺失值用None替代就可以了；

4. MasVnrArea：Masonry veneer area in square feet

表示墙面砖的大小，对缺失值用0替代；
'''
full['MasVnrType'] = full['MasVnrType'].fillna('None')
full['MasVnrArea'] = full['MasVnrArea'].fillna(0)
'''
5. MSZoning：Identifies the general zoning classification of the sale.
表示出售的房屋空间的分区分级，我们使用最常出现的值代替；
'''
full['MSZoning'] = full['MSZoning'].fillna(full['MSZoning'].mode()[0])

pd.mode()函数对数据进行分组后求最常出现的值，比如下面的例子：

df = pd.DataFrame(np.arange(25).reshape(5,5))
df[3][2:4] = 6
df

#第三列出现最多的是6
df[3].mode()

'''6. Utilities：Type of utilities available
表示房子里提供的水电设施，空值可能表示没有任何水电气设施；
'''
full['Utilities'] = full['Utilities'].fillna('none')
'''
BsmtFullBath:Basement full bathrooms
BsmtHalfBath :Basement half bathrooms
表示地下室全部作为浴室还是一半作为浴室，缺失值可能是因为无地下室，用0填充
'''
full['BsmtFullBath'] = full['BsmtFullBath'].fillna(0)
full['BsmtHalfBath'] = full['BsmtHalfBath'].fillna(0)
'''
Functional: Home functionality
表示房子的实用性，NA表示typical
'''
full['Functional'] = full['Functional'].fillna('Typ')
'''
Exterior1st:房子外观，填充为最常见的值
'''
full['Functional'] = full['Functional'].fillna('Typ')
'''
BsmtUnfSF:Unfinished square feet of basement area
BsmtFinSF2:Type 2 finished square feet
BsmtFinSF1: Type 1 finished square feet 
TotalBsmtSF: Total square feet of basement area
表示地下室某种面积，直接填充0.
'''
full['BsmtUnfSF'] = full['BsmtUnfSF'].fillna(0)
full['BsmtFinSF2'] = full['BsmtFinSF2'].fillna(0)
full['BsmtFinSF1'] = full['BsmtFinSF1'].fillna(0)
full['TotalBsmtSF'] = full['TotalBsmtSF'].fillna(0)

'''
对于SaleType，解释如下：
SaleType: Type of sale
       WD   Warranty Deed - Conventional
       CWD  Warranty Deed - Cash
       VWD  Warranty Deed - VA Loan
       New  Home just constructed and sold
       COD  Court Officer Deed/Estate
       Con  Contract 15% Down payment regular terms
       ConLw  Contract Low Down payment and low interest
       ConL   Contract Low Interest
       ConLD  Contract Low Down
       Oth  Other

KitchenQual: Kitchen quality

       Ex   Excellent
       Gd   Good
       TA   Typical/Average
       Fa   Fair
       Po   Poor
直接填充出现的最多的值
'''
full['SaleType'].head()

full['SaleType'] = full['SaleType'].fillna(full['SaleType'].mode()[0])
full['KitchenQual'] = full['KitchenQual'].fillna(full['KitchenQual'].mode()[0])
full['Electrical'].head()

full['Electrical'] = full['Electrical'].fillna(full['Electrical'].mode()[0])
full['Exterior1st'] = full['Exterior1st'].fillna(0)
full['Exterior2nd'] = full['Exterior2nd'].fillna(0)
full_null = full.isnull().sum()
full_null.head()

full_null = full_null.drop(full_null[full_null == 0].index)
full_null

只剩下目标值未填充，说明填充完成！

3.1.2 数据分布分析

我们将对数值型数据和object类型数据进行分析：

数值型数据我们使用seaborn的distplot方法可视化其分布；
object类型我们观察分析其箱形图。

#首先我们分离出训练数据
train_anal = full.loc[:train_shape-1][:]
train_anal.shape
(1460, 81)
#再分离出测试数据
test_anal = full.loc[train_shape-1:][:]
test_anal.shape
(1460, 81)
#选择出数值型数据
train_anal_numcol = list(train_anal.select_dtypes(include = ['number']).columns)
#移除Id和SalePrice
train_anal_numcol.remove('Id')
train_anal_numcol.remove('SalePrice')
#使用melt()函数将二维表转化为一维连续表格，并画出其分布图
train_anal_num_plot = pd.melt(train_anal,value_vars = train_anal_numcol)
train_anal_num_plot.head()

#seaborn的FacetGrid()函数可以做出此类数据的分布和拟合曲线
plot = sns.FacetGrid(train_anal_num_plot,col = 'variable',col_wrap = 4,sharex = False,sharey = False)
show_plot = plot.map(sns.distplot,'value')
plt.show()

从以上的分布图中我们可以得出以下两点：

有很多的数值型数据实际上是分类型的数据，我们考虑将其转化为str或其他类型；
对于连续性变量中分布接近正态分布，我们可以尝试使用变换的方法转化为正态分布，便于机器学习的模型进项预测。

#转化为str类型
col_into_str = ['GarageYrBlt','MSSubClass','MoSold','YrSold','OverallCond','OverallQual']
for i in col_into_str:
    train_anal[i] = train_anal[i].astype(str)
    test_anal[i] = test_anal[i].astype(str)

我们观察obj类型，也就是分类类型数据的分布，直接绘制箱形图。

#选择出obj类型数据
train_anal_objcol = list(train_anal.select_dtypes(include = ['object']).columns)
#将obj转化为category类型
for i in train_anal_objcol:
    train_anal[i] = train_anal[i].astype('category')

绘制箱形图:

#使用melt()函数转化为一维数据，在一张图中展示出来，方便比较
train_anal_obj_plot = pd.melt(train_anal,id_vars = ['SalePrice'],
                              value_vars = train_anal_objcol)
#绘制箱形图
train_anal_obj_plot_box = sns.FacetGrid(train_anal_obj_plot,col = 'variable',
                                      col_wrap = 2,
                                      sharex = False,sharey = False,
                                      size = 5
                                       )
train_anal_obj_plot_box
<seaborn.axisgrid.FacetGrid at 0x225fbc5dba8>
train_anal_obj_plot_box = train_anal_obj_plot_box.map(sns.boxplot,'value','SalePrice')
plt.show()

从上面的箱形图中我们可以得出以下结论：

OverallQual与目标SalePrice有着很强的线性相关，总体评分越高，房价也越高；
Exterior(外饰)对房价的影响也较高；
涉及到quality的项目，Ex(exllent)的房子售价都较高； MSSubClass对房价有较大的影响；
Neighborhood对房价影响很大。

根据前面的分析，我们选择以下列进行编码，并作为特征：

cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
cols = list(cols)
full[cols].head()

#导入lable_encolder包
from sklearn.preprocessing import LabelEncoder
#使用循环,编码所选列
for i in cols:
    le = LabelEncoder()
    le.fit(list(full[i].values))
    full[i] = le.transform(full[i].values)
full[cols].head()

3.1.3 相关性分析

绘制热力图，简要观察各列相关性大小。

corr = train_anal.corr()
f = plt.subplots(figsize = (13,10))
sns.heatmap(corr,vmax = 0.8,square = True)
plt.show()

corr['SalePrice'].sort_values(ascending = False)

3.1.4 异常值处理

我们通过散点图观察相关性较大的列和目标值的分布，找出异常值：

sns.set()
col_scatterplot = ['SalePrice', 'OverallQual', 'GarageArea','GrLivArea', 'GarageCars', 'TotalBsmtSF',
                   'FullBath','1stFlrSF']
scatter_train = full.loc[:train_shape-1][:]
train_anal.shape
(1460, 81)
sns.pairplot(scatter_train[col_scatterplot],size = 3)
plt.show()

可以明显看出：

GrLivArea有明显两个点突兀；
TotalBsmtSF,1stFlrSF各有1个点突兀。

我们在进行删除异常数据需要十分注意，因为我们很可能将有效值删掉了。此时需要仔细观察之后再做决定，在此次分析中， GrLivArea、FullBath、1stFlrSF都表示房屋某种的面积，一般来说不存在面积越大越便宜的房子，所以我们将其删除。

full = full.drop(full[(full['GrLivArea']>4000) & (full['SalePrice']<12.5)].index)
full.shape

(2917, 81)

full = full.drop(full[(full['TotalBsmtSF']>4000) & (full['SalePrice']<12.5)].index)
full = full.drop(full[(full['1stFlrSF']>4000) & (full['SalePrice']<12.5)].index)
full.shape

(2917, 81)

需要注意的是我们删除了原数据源train中的2条记录，在分离数据时需要减去相应数目的数据记录。

3.1.4 偏离变量处理

参照kaggle中优秀作品的条理，我们对偏离变量进行分析。

在前面的分布分析中我们明显看到，很多数值类的数据分布的形状类似于正态分布，但是明显向左或者向右偏离正常的正态分布，而各种机器学习的模型又对正态分布的变量十分友好，那么，如果能将这些的变量转化为正态分布就会使预测更加精准。

scipy的boxcox（）函数可以帮助我们完成这个转化。

首先，我们计算每个变量的偏离程度：

#选择出数值类数据
full_num_col = full.dtypes[full.dtypes != 'object'].index
full_num_col
Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr',
       'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF',
       'CentralAir', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Fence',
       'FireplaceQu', 'Fireplaces', 'FullBath', 'Functional', 'GarageArea',
       'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageYrBlt',
       'GrLivArea', 'HalfBath', 'HeatingQC', 'Id', 'KitchenAbvGr',
       'KitchenQual', 'LandSlope', 'LotArea', 'LotFrontage', 'LotShape',
       'LowQualFinSF', 'MSSubClass', 'MasVnrArea', 'MiscVal', 'MoSold',
       'OpenPorchSF', 'OverallCond', 'OverallQual', 'PavedDrive', 'PoolArea',
       'PoolQC', 'SalePrice', 'ScreenPorch', 'Street', 'TotRmsAbvGrd',
       'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd', 'YrSold'],
      dtype='object')
#计算每列偏离度
skewed = full[full_num_col].apply(lambda x:skew(x.dropna())).sort_values(ascending = False)
skewed = pd.DataFrame(skewed)

skewed

#去除我们分析的目标SalePrice
skewed = skewed.drop(['SalePrice'])
skewed = skewed.drop(['Id'])
  
#我们选择出偏离值绝对值大于0.75的列进行转化
skewed = skewed[abs(skewed) > 0.75]

#导入boxcox计算包
from scipy.special import boxcox1p

使用boxcox1p函数需要制定lam参数，此参数求解我没有掌握，因此直接仿照kaggle其他选手的做法，取0.15。

skewed_col = list(skewed.index)
lam = 0.15
for i in skewed_col:
    full[i] = boxcox1p(full[i],lam)

3.3 特征选择

我们选取若干个相关性较高的数值型特征和非数值型特征：

#选取数值型特征
col_num = corr['SalePrice'].sort_values(ascending = False)
col_num_highcorr = col_num[col_num > 0.3]
col_num_highcorr = list(col_num_highcorr.index)
col_num_highcorr.remove('SalePrice')
col_num_highcorr = full[col_num_highcorr]
col_num_highcorr.head()

col_num_highcorr.shape
(2917, 17)
#选择分类型数值
cat_cols = full[cols]
cat_cols.shape

(2917, 26)

full_X = pd.concat([col_num_highcorr,cat_cols],axis = 1)
full_X.head()

full_X.shape
(2917, 43)

将所有分类型变量进行编码

full_X = pd.get_dummies(full_X)
full_X.shape
(2917, 43)
full_X.head()

full = pd.get_dummies(full)
full.shape

(2917, 226)

full.head()

4.构建模型(Modeling)

4.1 建立训练数据集和测试数据集

首先，分离出训练数据和测试数据：

#由于之前删除了2条记录，所以分离数据需要多减去2条
train = full[0:train_shape-1-2][:]
test = full[train_shape-1-2:][:]
train_shape

1460

'''
我们将原始的训练数据集前面加source加以区分，在原始数据集中拆分出训练数据和测试数据
原始数据有1460条记录
'''
#原始数据集特征
source_X = full_X.loc[0:train_shape-1,:]
#原始数据集标签
source_y = full.loc[0:train_shape-1,'SalePrice']

#预测数据集特征
pred_X = full_X.loc[train_shape:,:]
#检查一下分割是否正确
print('souce_X shape:',source_X.shape[0])
print('pred_X shape:',pred_X.shape[0])

souce_X shape: 1458
pred_X shape: 1459

结果符合事实。

'''
从原始数据集（source）中拆分出训练数据集（用于模型训练train），测试数据集（用于模型评估test）
train_test_split是交叉验证中常用的函数，功能是从样本中随机的按比例选取train data和test data
train_data：所要划分的样本特征集
train_target：所要划分的样本结果
test_size：样本占比，如果是整数的话就是样本的数量
'''
from sklearn.cross_validation import train_test_split

train_X,test_X,train_y,test_y = train_test_split(source_X,source_y,train_size = 0.8)


#输出数据集大小
print ('原始数据集特征：',source_X.shape, 
       'n训练数据集特征：',train_X.shape ,
      'n测试数据集特征：',test_X.shape)

print ('n原始数据集标签：',source_y.shape, 
       'n训练数据集标签：',train_y.shape ,
      'n测试数据集标签：',test_y.shape)

4.2 选择算法

我准备尝试3个不同算法，验证其预测效果：

线性回归模型：

from sklearn.linear_model import LinearRegression
model_linear = LinearRegression()

2.LassoCV模型：

from sklearn.linear_model import LassoCV
model_LassoCV = LassoCV(alphas = [1,0.1,0.001,0.005])

3.XGBoost 模型

import xgboost as xgb 

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

4.3 训练模型

#1.逻辑回归模型：
model_linear.fit(train_X,train_y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
#2.LassoCV模型：
model_LassoCV.fit(train_X,train_y)
LassoCV(alphas=[1, 0.1, 0.001, 0.005], copy_X=True, cv=None, eps=0.001,
    fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1,
    normalize=False, positive=False, precompute='auto', random_state=None,
    selection='cyclic', tol=0.0001, verbose=False)
#3.RidgeCV模型
model_xgb.fit(train_X,train_y)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05,
       max_delta_step=0, max_depth=3, min_child_weight=1.7817,
       missing=None, n_estimators=2200, n_jobs=1, nthread=-1,
       objective='reg:linear', random_state=7, reg_alpha=0.464,
       reg_lambda=0.8571, scale_pos_weight=1, seed=None, silent=1,
       subsample=0.5213)

5 评估模型

5.1 建立评估依据

为了更准确的评估，我们不适用简单的score()评分，而是参照kaggle提示，定义一个函数用于评分：

#导入cross_val_score（）函数
from sklearn.model_selection import  KFold,cross_val_score

KFold()函数是交叉验证最常用的方法之一；

其测试思路是将原始数据分成K组，将每个子集数据分别做一次验证集，其余的K-1组子集数据作为训练集，这样会得到K个模型，用这K个模型最终的验证集的分类准确率的平均数作为此K-CV下分类器的性能指标。

本例中就是将train数据分为n_fold份，每个子集做1次测试集，其余4份子集作为训练集，如此循环5次，最大限度减少单次取样的随机误差。

使用crosss_val_score的平方根值作为评定的依据。

#定义评估函数
n_folds = 5

def resle_cv(model):
    kf = KFold(n_folds,shuffle = True,random_state = 42).get_n_splits(source_X,source_y)
    rmse = np.sqrt(-cross_val_score(model,source_X,source_y,
                                    scoring='neg_mean_squared_error',
                                   cv = kf))
    return (rmse)

5.2模型评估

#model_linear评估
score_model_linear = resle_cv(model_linear)
score_model_linear = [score_model_linear.mean(),score_model_linear.std()]
score_model_linear

[0.13693449571925723, 0.0090525601888024141]

#model_LassoCV评估
score_model_LassoCV = resle_cv(model_LassoCV)
score_model_LassoCV  = [score_model_LassoCV.mean(),score_model_LassoCV.std()]
score_model_LassoCV

[0.14129442231358591, 0.0089227750375478563]

#model_xgb评估
score_model_xgb = resle_cv(model_xgb)
score_model_xgb  = [score_model_xgb.mean(),score_model_xgb.std()]
score_model_xgb

[0.12914188325706497, 0.0066365151551340948]

这样，我们选择LassoCV模型作为最终预测模型。

6.方案实施（Deployment）

6.1保存预测结果并上传到kaggle中

#使用机器学习模型预测
pred_Y = model_LassoCV.predict(pred_X)
#Id
Id = full.loc[train_shape:,'Id']

别忘记我们目标值已经经过一次log转化了，那么我们对结果需要进行一次逆运算：

pred_Y = np.exp(pred_Y)
#生成结果数据框
predDf = pd.DataFrame(
            {'Id':Id,
            'SalePrice':pred_Y}
)
predDf.shape
predDf.head()

#保存结果
predDf.to_csv('houseprise_predict_4',index = False)

最后，我们将结果上传到kaggle中，测评后排名在2313（共4630），越前50%。

后记：

本次项目数据特征很多（81列），总数据量不是很大，但对数据进行分析提取并不是容易的事。

有意思的是，在此项目进行到一半时与人讨论，对方说这都是别人做烂的项目，当时真是有“食之无味，弃之可惜”的感觉，但是仍然坚持着做完了。

原本是打算自己单独完成这个项目，奈何第一次上传结果准确率实在太低，参考优秀作品修改了很多，也慢慢理解的斜度、常见的机器学习包等等内容。完成后虽然比一开始懂得更多，但是却越发觉得自己不足，需要继续通过项目打基础，练内功！