Data preprocessing, missing values, errors, and outliers
missing values
The below cells summarize the missing values. By analyzing such features, I find something wrong in both train and test data sets. Before imputing the missing values, I correct the values by guessing.
There are three group of features have error values. We keep such features in a list, do the fillna as usual, then fill such features with median. The last step is a refinement.
garage features
For garage, there are two examples in test sets, whose GarageFinish is NaN,but it has GarageType. For 2127, NaN should be replaced by median; for 2577, GarageType should be replaced by NaN. Other than these error values, NANs should be replaced by Nones.
GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual \
666 Detchd NaN NaN 1.0 360.0 NaN
1116 Detchd NaN NaN NaN NaN NaN
GarageCond
666 NaN
1116 NaN
666 2127
1116 2577
Name: Id, dtype: int64
the last step of data processing is transform the categorical and ordinary features, string expressions into numerical encoding.
#Transforming some numerical variables that are really categorical
all_data['MSSubClass']= all_data['MSSubClass'].apply(str)#Changing OverallCond into a categorical variable
all_data['OverallCond']= all_data['OverallCond'].astype(str)#Year and month sold are transformed into categorical features.
all_data['YrSold']= all_data['YrSold'].astype(str)
all_data['MoSold']= all_data['MoSold'].astype(str)print(all_data.head())
#Label Encoding some categorical variables that may contain information in their ordering setfrom sklearn.preprocessing import LabelEncoder
cols =('FireplaceQu','BsmtQual','BsmtCond','GarageQual','GarageCond','ExterQual','ExterCond','HeatingQC','PoolQC','KitchenQual','BsmtFinType1','BsmtFinType2','Functional','Fence','BsmtExposure','GarageFinish','LandSlope','LotShape','PavedDrive','Street','Alley','CentralAir','OverallCond','YrSold')# process columns, apply LabelEncoder to categorical featuresfor c in cols:
lbl = LabelEncoder()
lbl.fit(list(all_data[c].values))
all_data[c]= lbl.transform(list(all_data[c].values))# shape print('Shape all_data: {}'.format(all_data.shape))
all_data.head()
Shape all_data: (2914, 78)
MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
LotShape
LandContour
LotConfig
LandSlope
...
ScreenPorch
PoolArea
PoolQC
Fence
MiscFeature
MiscVal
MoSold
YrSold
SaleType
SaleCondition
0
60
RL
65.0
8450
1
1
3
Lvl
Inside
0
...
0
0
3
4
None
0
2
2
WD
Normal
1
20
RL
80.0
9600
1
1
3
Lvl
FR2
0
...
0
0
3
4
None
0
5
1
WD
Normal
2
60
RL
68.0
11250
1
1
0
Lvl
Inside
0
...
0
0
3
4
None
0
9
2
WD
Normal
3
70
RL
60.0
9550
1
1
0
Lvl
Corner
0
...
0
0
3
4
None
0
2
0
WD
Abnorml
4
60
RL
84.0
14260
1
1
0
Lvl
FR2
0
...
0
0
3
4
None
0
12
2
WD
Normal
5 rows × 78 columns
from scipy.stats import skew
numeric_feats = all_data.dtypes[all_data.dtypes !="object"].index
print(numeric_feats)# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew':skewed_feats})
skewness.head(10)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
D:\Users\13487\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
0.1268197712423099
import matplotlib
coef = pd.Series(model_lasso.coef_, index = Xtrain.columns)print("Lasso picked "+str(sum(coef !=0))+" variables and eliminated the other "+str(sum(coef ==0))+" variables")
imp_coef = pd.concat([coef.sort_values().head(20),
coef.sort_values().tail(20)])
matplotlib.rcParams['figure.figsize']=(8.0,10.0)
imp_coef.plot(kind ="barh")
plt.title("Coefficients in the Lasso Model")
Lasso picked 88 variables and eliminated the other 134 variables
Text(0.5, 1.0, 'Coefficients in the Lasso Model')