House Price Prediction 房价预测 Kaggle 比赛 LB 0.11413 排名 5%

House Price Prediction

Data preprocessing, missing values, errors, and outliers

outliers

Linear Regression

提供了1460套房屋包括房屋质量，面积等79个特征和价格，要求利用同样的特征对另外1459套房屋的价格做回归预测。评价指标为取了对数之后的RMSE， baseline是0.40890，我的成绩可以达到0.11413，排名135/4745 。数据集中既有离散型也有连续性特征，而且存在大量的缺失值。提供了data_description.txt这个文件，对各个特征的含义进行了描述，理解了其中内容后对于大部分缺失值就都能顺利填充了。

train = pd.read_csv("input/train.csv")
test = pd.read_csv("input/test.csv")
print(train.shape)
print(test.shape)
all_data = pd.concat((train.loc[:,"MSSubClass":"SaleCondition"],
                      test.loc[:,"MSSubClass":"SaleCondition"]))
print(all_data.shape)
print(train.columns)
numeric_feats = all_data.dtypes[(all_data.dtypes != "object")].index

(1460, 81)
(1459, 80)
(2919, 79)
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

Data preprocessing, missing values, errors, and outliers

missing values

The below cells summarize the missing values. By analyzing such features, I find something wrong in both train and test data sets. Before imputing the missing values, I correct the values by guessing.

## missing values
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
print(missing_data.head(34))
print((all_data_na.index))

              Missing Ratio
PoolQC            99.657417
MiscFeature       96.402878
Alley             93.216855
Fence             80.438506
FireplaceQu       48.646797
LotFrontage       16.649538
GarageFinish       5.447071
GarageYrBlt        5.447071
GarageQual         5.447071
GarageCond         5.447071
GarageType         5.378554
BsmtExposure       2.809181
BsmtCond           2.809181
BsmtQual           2.774923
BsmtFinType2       2.740665
BsmtFinType1       2.706406
MasVnrType         0.822199
MasVnrArea         0.787941
MSZoning           0.137033
BsmtFullBath       0.068517
BsmtHalfBath       0.068517
Utilities          0.068517
Functional         0.068517
Exterior2nd        0.034258
Exterior1st        0.034258
SaleType           0.034258
BsmtFinSF1         0.034258
BsmtFinSF2         0.034258
BsmtUnfSF          0.034258
Electrical         0.034258
KitchenQual        0.034258
GarageCars         0.034258
GarageArea         0.034258
TotalBsmtSF        0.034258
Index(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage',
       'GarageFinish', 'GarageYrBlt', 'GarageQual', 'GarageCond', 'GarageType',
       'BsmtExposure', 'BsmtCond', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1',
       'MasVnrType', 'MasVnrArea', 'MSZoning', 'BsmtFullBath', 'BsmtHalfBath',
       'Utilities', 'Functional', 'Exterior2nd', 'Exterior1st', 'SaleType',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'Electrical', 'KitchenQual',
       'GarageCars', 'GarageArea', 'TotalBsmtSF'],
      dtype='object')

error values

There are three group of features have error values. We keep such features in a list, do the fillna as usual, then fill such features with median. The last step is a refinement.

garage features

For garage, there are two examples in test sets, whose GarageFinish is NaN，but it has GarageType. For 2127, NaN should be replaced by median; for 2577, GarageType should be replaced by NaN. Other than these error values, NANs should be replaced by Nones.

error_value_test = {}
error_value_train ={}
print(test[(test.GarageFinish.isnull()==True) & (test.GarageType.isnull() == False)].loc[:,"GarageType":"GarageCond"])
print(test[(test.GarageFinish.isnull()==True) & (test.GarageType.isnull() == False)].Id)
error_value_test["GarageYrBlt"] = 2127
error_value_test["GarageFinish"] = 2127
error_value_test["GarageQual"] = 2127
error_value_test["GarageCond"] = 2127

     GarageType  GarageYrBlt GarageFinish  GarageCars  GarageArea GarageQual  \
666      Detchd          NaN          NaN         1.0       360.0        NaN   
1116     Detchd          NaN          NaN         NaN         NaN        NaN   

     GarageCond  
666         NaN  
1116        NaN  
666     2127
1116    2577
Name: Id, dtype: int64

#test.loc[1116,"GarageType"] = "None"
#test.loc[1116,"GarageArea"] = 0
#test.loc[1116,"GarageCars"] = 0
#test.loc[1116,"GarageYrBlt"] = 0
#test.loc[1116,"GarageFinish"] = "None"
#test.loc[1116,"GarageQual"] = "None"
#test.loc[1116,"GarageCond"] = "None"

basement features

For basement, in train set, there are three example, that BsmtExposure is Nan, but has other features, it must be an error.

print(train[(train.BsmtExposure.isnull() == True) & (train.BsmtQual.isnull() == False)].loc[:,"BsmtQual":"BsmtUnfSF"])
print(train[(train.BsmtExposure.isnull() == True) & (train.BsmtQual.isnull() == False)].Id)
print(test[(test.BsmtExposure.isnull() == True) & (test.BsmtQual.isnull() == False)].loc[:,"BsmtQual":"BsmtUnfSF"])
print(test[(test.BsmtExposure.isnull() == True) & (test.BsmtQual.isnull() == False)].Id)
error_value_train["BsmtExposure"] = 949
error_value_test["BsmtExposure"] = [1488,2349]

    BsmtQual BsmtCond BsmtExposure BsmtFinType1  BsmtFinSF1 BsmtFinType2  \
948       Gd       TA          NaN          Unf           0          Unf   

     BsmtFinSF2  BsmtUnfSF  
948           0        936  
948    949
Name: Id, dtype: int64
    BsmtQual BsmtCond BsmtExposure BsmtFinType1  BsmtFinSF1 BsmtFinType2  \
27        Gd       TA          NaN          Unf         0.0          Unf   
888       Gd       TA          NaN          Unf         0.0          Unf   

     BsmtFinSF2  BsmtUnfSF  
27          0.0     1595.0  
888         0.0      725.0  
27     1488
888    2349
Name: Id, dtype: int64

print(test[(test["BsmtFullBath"].isnull())].loc[:,"BsmtFullBath":"BsmtHalfBath"])
print(train["Functional"].mode())
# deal with it below

     BsmtFullBath  BsmtHalfBath
660           NaN           NaN
728           NaN           NaN
0    Typ
dtype: object

Masonary features

For MasVnrType, there is an example, whose area is 198, but has nan Type.

print(test[(test.MasVnrType.isnull() == True) & (test.MasVnrArea.isnull() == False)][:][["MasVnrType","MasVnrArea"]])
print(test[(test.MasVnrType.isnull() == True) & (test.MasVnrArea.isnull() == False)].Id)
error_value_test["MasVnrType"] = 2611
print(error_value_train)
print(error_value_test)

     MasVnrType  MasVnrArea
1150        NaN       198.0
1150    2611
Name: Id, dtype: int64
{'BsmtExposure': 949}
{'GarageYrBlt': 2127, 'GarageFinish': 2127, 'GarageQual': 2127, 'GarageCond': 2127, 'BsmtExposure': [1488, 2349], 'MasVnrType': 2611}

outliers

There are two outliers with high area but incredible price. Discard them.

# inorder to plot features in a batch, define the following function
def plotfeats(frame,feats,kind,cols=4):
    """plot in a batch
    Parameters
    ----------
    frame : pandas.DataFrame
    feats : list 或 numpy.array
    kind : str
        'hist'-histogram
        'scatter'
        'hs'-直方图和散点图隔行交替
        'box'-箱线图，每个feat一幅图；
        'boxp'-Price做纵轴，feat做横轴的箱线图。        
    cols : int
        每行绘制几幅图    
    Returns
    -------
    None
    """
    rows = int(np.ceil((len(feats))/cols))
    if rows==1 and len(feats)<cols:
        cols = len(feats)
    #print("输入%d个特征，分%d行、%d列绘图" % (len(feats), rows, cols))
    if kind == 'hs': #hs:hist and scatter
        fig, axes = plt.subplots(nrows=rows*2,ncols=cols,figsize=(cols*5,rows*10))
    else:
        fig, axes = plt.subplots(nrows=rows,ncols=cols,figsize=(cols*5,rows*5))
        if rows==1 and cols==1:
            axes = np.array([axes])
        axes = axes.reshape(rows,cols) # 当 rows=1 时，axes.shape:(cols,)，需要reshape一下
    i=0
    for f in feats:
        #print(int(i/cols),i%cols)
        if kind == 'hist':
            #frame.hist(f,bins=100,ax=axes[int(i/cols),i%cols])
            frame.plot.hist(y=f,bins=100,ax=axes[int(i/cols),i%cols])
        elif kind == 'scatter':
            frame.plot.scatter(x=f,y='SalePrice',ylim=(0,800000), ax=axes[int(i/cols),i%cols])
        elif kind == 'hs':
            frame.plot.hist(y=f,bins=100,ax=axes[int(i/cols)*2,i%cols])
            frame.plot.scatter(x=f,y='SalePrice',ylim=(0,800000), ax=axes[int(i/cols)*2+1,i%cols])
        elif kind == 'box':
            frame.plot.box(y=f,ax=axes[int(i/cols),i%cols])
        elif kind == 'boxp':
            sns.boxplot(x=f,y='SalePrice', data=frame, ax=axes[int(i/cols),i%cols])
        i += 1
    plt.show()

feats = ['LotFrontage','LotArea','BsmtFinSF1','BsmtFinSF2','1stFlrSF','GrLivArea','TotalBsmtSF']
plotfeats(train,feats,kind='scatter')

在这里插入图片描述

outlier_id = []
for f in ["LotFrontage","GrLivArea"]:
    print(train.sort_values(by = f, ascending = False)[:2][["Id",f,"SalePrice"]])
    output_id = train.sort_values(by = f, ascending = False)[:2][["Id"]]
    outlier_id.extend(output_id.values)
for f in ['LotArea','BsmtFinSF1','BsmtFinSF2','1stFlrSF','TotalBsmtSF']:
    print(train.sort_values(by = f, ascending = False)[:1][["Id",f,"SalePrice"]])
    output_id = train.sort_values(by = f, ascending = False)[:1][["Id"]]
    outlier_id.extend(output_id.values)
outlier_id = np.unique(outlier_id)
print("the outlier Ids are", outlier_id)

        Id  LotFrontage  SalePrice
1298  1299        313.0     160000
934    935        313.0     242000
        Id  GrLivArea  SalePrice
1298  1299       5642     160000
523    524       4676     184750
      Id  LotArea  SalePrice
313  314   215245     375000
        Id  BsmtFinSF1  SalePrice
1298  1299        5644     160000
      Id  BsmtFinSF2  SalePrice
322  323        1474     301000
        Id  1stFlrSF  SalePrice
1298  1299      4692     160000
        Id  TotalBsmtSF  SalePrice
1298  1299         6110     160000
the outlier Ids are [ 314  323  524  935 1299]

train = train.drop(outlier_id)
print(train.shape)

(1455, 81)

dealing with missing values

For features with Nan representing None, replace the Nan with None. For features without None option，replace Nan with the most frequent option.

feats_nan_median = ['LotFrontage','MasVnrArea']
feats_nan_none = ["Alley",'PoolQC','FireplaceQu','Fence',
                  'GarageQual', 'GarageCond', 'GarageType','GarageFinish', 'MasVnrType','MiscFeature',
                  'BsmtExposure', 'BsmtCond', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']
feats_nan_mode0 = ['MSZoning',  'KitchenQual',  'Electrical','Exterior2nd', 'Exterior1st', 'SaleType',
                   'Functional']
feats_nan_0 = ['GarageArea','GarageCars','GarageYrBlt',
               'BsmtFinSF1','BsmtFinSF2','BsmtUnfSF', 
               'TotalBsmtSF','BsmtFullBath', 'BsmtHalfBath']
print(len(feats_nan_mode0)+len(feats_nan_none)+len(feats_nan_median)+len(feats_nan_0))
#feats = feats_nan_median + feats_nan_mode0 + feats_nan_none

for f in feats_nan_none:
    train[f] = train[f].fillna("None")
    test[f] = test[f].fillna("None")
for f in feats_nan_mode0:
    train[f] = train[f].fillna(train[f].mode()[0])
    test[f] = test[f].fillna(test[f].mode()[0])
for f in feats_nan_median:
    train[f] = train[f].fillna(train[f].median())
    test[f] = test[f].fillna(test[f].median())
for f in feats_nan_0:
    train[f] = train[f].fillna(0)
    test[f] = test[f].fillna(0)
train = train.drop(["Utilities"], axis = 1)
test = test.drop(["Utilities"], axis = 1)
#train = train.drop(["GarageCars"], axis = 1)
#test = test.drop(["GarageCars"], axis = 1)
#train = train.drop(["Neighborhood"], axis = 1)
#test = test.drop(["Neighborhood"],axis = 1)

all_data = pd.concat((train.loc[:,"MSSubClass":"SaleCondition"],
                      test.loc[:,"MSSubClass":"SaleCondition"]))
print(all_data.shape)
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
print(all_data.head())
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
print(missing_data.head())

(2914, 78)
   MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0          60       RL         65.0     8450   Pave  None      Reg   
1          20       RL         80.0     9600   Pave  None      Reg   
2          60       RL         68.0    11250   Pave  None      IR1   
3          70       RL         60.0     9550   Pave  None      IR1   
4          60       RL         84.0    14260   Pave  None      IR1   

  LandContour LotConfig LandSlope  ... ScreenPorch PoolArea PoolQC Fence  \
0         Lvl    Inside       Gtl  ...           0        0   None  None   
1         Lvl       FR2       Gtl  ...           0        0   None  None   
2         Lvl    Inside       Gtl  ...           0        0   None  None   
3         Lvl    Corner       Gtl  ...           0        0   None  None   
4         Lvl       FR2       Gtl  ...           0        0   None  None   

  MiscFeature  MiscVal  MoSold  YrSold  SaleType SaleCondition  
0        None        0       2    2008        WD        Normal  
1        None        0       5    2007        WD        Normal  
2        None        0       9    2008        WD        Normal  
3        None        0       2    2006        WD       Abnorml  
4        None        0      12    2008        WD        Normal  

[5 rows x 78 columns]
Empty DataFrame
Columns: [Missing Ratio]
Index: []

label encoding

the last step of data processing is transform the categorical and ordinary features, string expressions into numerical encoding.

#Transforming some numerical variables that are really categorical
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)
print(all_data.head())

  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape LandContour  \
0         60       RL         65.0     8450   Pave  None      Reg         Lvl   
1         20       RL         80.0     9600   Pave  None      Reg         Lvl   
2         60       RL         68.0    11250   Pave  None      IR1         Lvl   
3         70       RL         60.0     9550   Pave  None      IR1         Lvl   
4         60       RL         84.0    14260   Pave  None      IR1         Lvl   

  LotConfig LandSlope  ... ScreenPorch PoolArea PoolQC Fence MiscFeature  \
0    Inside       Gtl  ...           0        0   None  None        None   
1       FR2       Gtl  ...           0        0   None  None        None   
2    Inside       Gtl  ...           0        0   None  None        None   
3    Corner       Gtl  ...           0        0   None  None        None   
4       FR2       Gtl  ...           0        0   None  None        None   

   MiscVal MoSold  YrSold  SaleType SaleCondition  
0        0      2    2008        WD        Normal  
1        0      5    2007        WD        Normal  
2        0      9    2008        WD        Normal  
3        0      2    2006        WD       Abnorml  
4        0     12    2008        WD        Normal  

[5 rows x 78 columns]

#Label Encoding some categorical variables that may contain information in their ordering set

from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'OverallCond', 
        'YrSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))
# shape        
print('Shape all_data: {}'.format(all_data.shape))
all_data.head()

Shape all_data: (2914, 78)

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	LotConfig	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition
0	60	RL	65.0	8450	1	1	3	Lvl	Inside	...	3	4	None	2	2	WD	Normal
1	20	RL	80.0	9600	1	1	3	Lvl	FR2	...	3	4	None	5	1	WD	Normal
2	60	RL	68.0	11250	1	1	0	Lvl	Inside	...	3	4	None	9	2	WD	Normal
3	70	RL	60.0	9550	1	1	0	Lvl	Corner	...	3	4	None	2	0	WD	Abnorml
4	60	RL	84.0	14260	1	1	0	Lvl	FR2	...	3	4	None	12	2	WD	Normal

5 rows × 78 columns

from scipy.stats import skew
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
print(numeric_feats)
# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

Index(['LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandSlope',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', 'HeatingQC', 'CentralAir', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscVal', 'YrSold'],
      dtype='object')

Skew in numerical features:

	Skew
MiscVal	21.928383
PoolArea	16.883698
LotArea	12.815295
LowQualFinSF	12.078203
3SsnPorch	11.366100
LandSlope	4.970397
KitchenAbvGr	4.297992
BsmtFinSF2	4.157554
EnclosedPorch	4.013850
ScreenPorch	3.942711

skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.1
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
print(all_data.shape)
#all_data[skewed_features] = np.log1p(all_data[skewed_features])

There are 56 skewed numerical features to Box Cox transform
(2914, 78)

Neighborhood_dict0 = { "MeadowV":0 , "IDOTRR" :0, "BrDale" :0,  "OldTown" :0, 
                      "Edwards":1,   "BrkSide" :1, "Sawyer": 1, "Blueste":1,
                      "SWISU":2, "NAmes":2, "NPkVill":2,
                      "Mitchel":3, "SawyerW":3, "Gilbert":3, "NWAmes":3,
                        "Blmngtn":4, "CollgCr":4, "ClearCr":4, "Crawfor": 4, "Veenker":4 ,
                     "Somerst":5, "Timber":5, "StoneBr":5, "NoRidge":5,   "NridgHt":5}
all_data["Neighborhood"] = all_data["Neighborhood"].map(Neighborhood_dict0)
#test["Neighborhood"] = test["Neighborhood"].map(Neighborhood_dict0)

train["SalePrice"] = np.log1p(train["SalePrice"])
all_data = pd.get_dummies(all_data)
print(all_data.shape)
print(all_data.columns)

(2914, 222)
Index(['LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandSlope',
       'Neighborhood', 'OverallQual', 'OverallCond', 'YearBuilt',
       ...
       'SaleType_ConLw', 'SaleType_New', 'SaleType_Oth', 'SaleType_WD',
       'SaleCondition_Abnorml', 'SaleCondition_AdjLand',
       'SaleCondition_Alloca', 'SaleCondition_Family', 'SaleCondition_Normal',
       'SaleCondition_Partial'],
      dtype='object', length=222)

Xtrain = all_data[:1455]
y = train["SalePrice"]
print(Xtrain.shape)
test = all_data[1455:]
print(test.shape)

(1455, 222)
(1459, 222)

Linear Regression

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, Xtrain, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

model_ridge = Ridge()
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() 
            for alpha in alphas]
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Cross Validation")
plt.xlabel("alpha")
plt.ylabel("rmse")
cv_ridge.min()

0.13067535008588724

在这里插入图片描述

model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(Xtrain, y)
rmse_cv(model_lasso).mean()

D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)





0.1268197712423099

import matplotlib
coef = pd.Series(model_lasso.coef_, index = Xtrain.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")
imp_coef = pd.concat([coef.sort_values().head(20),
                     coef.sort_values().tail(20)])
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")

Lasso picked 88 variables and eliminated the other 134 variables





Text(0.5, 1.0, 'Coefficients in the Lasso Model')

在这里插入图片描述

matplotlib.rcParams['figure.figsize'] = (6.0, 6.0)

preds = pd.DataFrame({"preds":np.expm1(model_lasso.predict(Xtrain)), "true":np.expm1(train.SalePrice),
                    "residuals": np.expm1(model_lasso.predict(Xtrain)) - np.expm1(train.SalePrice),
                    "residuals_abs": np.abs(np.expm1(model_lasso.predict(Xtrain)) - np.expm1(train.SalePrice))})
plt.scatter(x = preds.preds, y = preds.residuals_abs)
print(preds.sort_values(by="residuals_abs", ascending = False)[:10])
print(preds.sort_values(by="residuals_abs")[:10])

              preds      true      residuals  residuals_abs
523   432724.801424  184750.0  247974.801424  247974.801424
1182  500538.074778  745000.0 -244461.925222  244461.925222
1298  331184.992868  160000.0  171184.992868  171184.992868
898   440573.672510  611657.0 -171083.327490  171083.327490
803   418192.161810  582933.0 -164740.838190  164740.838190
1169  474192.792800  625000.0 -150807.207200  150807.207200
691   607734.568594  755000.0 -147265.431406  147265.431406
1046  414841.029008  556581.0 -141739.970992  141739.970992
185   337000.657942  475000.0 -137999.342058  137999.342058
688   255101.495521  392000.0 -136898.504479  136898.504479
              preds      true  residuals  residuals_abs
816   137004.146862  137000.0   4.146862       4.146862
1193  165018.935604  165000.0  18.935604      18.935604
201   171476.023343  171500.0 -23.976657      23.976657
901   153030.410811  153000.0  30.410811      30.410811
1088  137537.342363  137500.0  37.342363      37.342363
551   112457.637521  112500.0 -42.362479      42.362479
620    67045.249035   67000.0  45.249035      45.249035
14    156941.693323  157000.0 -58.306677      58.306677
0     208431.389580  208500.0 -68.610420      68.610420
818   154918.648664  155000.0 -81.351336      81.351336

在这里插入图片描述