数据采集和标记
爬虫
技巧总结
各行业小知识总结
数据清洗
object值
tab_1['字段'].unique()
def function(a):
if '数值或字符'in a :
return 1
else:
return 2
tab_1['结果'] = tab_1.apply(lambda x: function(x['结果']), axis = 1)
import re
def re_1(i):
res=re.sub("[^a-zA-Z]", " ",i)
return res
test_1['new_review'] = test_1.apply(lambda x: re_1(x['review']), axis = 1)
连续值
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
x = data["Alcohol"]
std = StandardScaler()
x_std = std.fit_transform(x)
离散值
缺失值
all_dummy_df.isnull().sum().sum()
mean_cols=all_dummy_df.mean()
all_dummy_df = all_dummy_df.fillna(mean_cols)
all_dummy_df = all_dummy_df.fillna(数字)
异常值
train_test['price'].ix[train_test['price']>13000] = 13000
train_test.loc[train_test["bathrooms"] == 112, "bathrooms"] = 1.5
时间序列
rng = pd.period_range('1/1/2017','2/28/2019',freq='M')
data_1=pd.Series(np.random.randn(len(rng)),index=rng)
df=pd.DataFrame({"data" :data_1,"企业编号":4001})
df.drop('data',inplace=True,axis=1)
df.head()
特征选择
选取贡献度超过95%的特征
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=2)
X_new = selector.fit_transform(X, Y)
kfold = KFold(n_splits=10)
cv_result = cross_val_score(model, X_new, Y, cv=kfold)
分析各特征关系
contFeatureslist = []
contFeatureslist.append("bathrooms")
contFeatureslist.append("bedrooms")
contFeatureslist.append("price")
correlationMatrix = train[contFeatureslist].corr().abs()
plt.subplots(figsize=(13, 9))
sns.heatmap(correlationMatrix,annot=True)
sns.heatmap(correlationMatrix, mask=correlationMatrix < 1, cbar=False)
plt.show()
模型选择
如何选择模型
models = []
models.append(("KNN", KNeighborsClassifier(n_neighbors=2)))
models.append(("KNN with weights", KNeighborsClassifier(
n_neighbors=2, weights="distance")))
models.append(("Radius Neighbors", RadiusNeighborsClassifier(
n_neighbors=2, radius=500.0)))
results = []
for name, model in models:
model.fit(X_train, Y_train)
results.append((name, model.score(X_test, Y_test)))
for i in range(len(results)):
print("name: {}; score: {}".format(results[i][0],results[i][1]))
k-近邻算法
线性回归算法
逻辑回归算法
决策树
支持向量机
朴素贝叶斯
pca算法
k-均值算法
xgboost
模型训练和测试
参数调节
模型性能评估和优化
准确度
查准率和召回率
模型使用
模型的保存 |