Python数据挖掘 # 4.用scikit-learn估计器分类 [实验：Ionosphere 分类]（近邻算法）

0 实验环境

python 3.7.0

matplotlib
scikit-learn

1 预备知识

估计器（Estimator）：用于分类、聚类和回归分析。
转换器（Transformer）：用于数据预处理和数据转换。
流水线（Pipeline）：组合数据挖掘流程，便于再次使用。

欧式距离：连接两点的线段的长度（特征向量长度平方和的平方根）

 某些特征取值巨大（离群点）时效果会很差。
 稀疏矩阵效果也很差。

曼哈顿距离：两个特征在标准坐标系中绝对轴距之和
余弦距离：特征向量夹角的余弦值

 更适合解决异常值和数据稀疏问题

采用哪种距离度量方法对最终结果有很大影响。

此处我们选用欧氏距离做为实验。

2 加载数据集

我们使用Ionosphere数据集。
URL: http://archive.ics.uci.edu/ml/datasets/Ionosphere
下载ionosphere.data、ionosphere.names(我是把文件放在主目录下的data文件夹然后新建了一个ionosphere文件夹)
让图片直接出在jupyter里我们需要做一个配置：

%matplotlib inline

获取主目录

import os
home_folder = os.path.expanduser("~")
print(home_folder)

data_folder = os.path.join(home_folder,"data","ionosphere")
data_filename = os.path.join(data_folder,"ionosphere.data")
print(data_filename)

笔者的测试环境是win10
故输出：
C:\Users\skysys\data\ionosphere\ionosphere.data

然后开始加载数据集，该数据集每行35个值，前34个为17座天线采集的数据（每座天线采集两个数据）最后一个值为g或者是b，表示数据的好坏（是否提供了有价值的信息）
我们的任务是建立一个分类器，判断数据的好坏。

import numpy as np
import csv
x = np.zeros((351,34),dtype='float')
y = np.zeros((351,),dtype='bool')
with open(data_filename,'r') as input_file:
    reader = csv.reader(input_file)
    for i,row in enumerate(reader):
        data = [float(datum) for datum in row[:-1]]
        x[i] = data
        y[i] = row[-1] == 'g'

3 数据处理

scikit-learn内部实现了大量的分类算法，把相关功能封装成一个估计器，用于分类任务。主要包含以下两个函数：

fit():训练算法，设置内部参数。参数为训练集和类别
predict():参数为测试集。返回预测结果。

本文选择近邻算法

划分数据集：

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=14)
print("There are {} samples in the training dataset".format(x_train.shape[0]))
print("There are {} samples in the testing dataset".format(x_test.shape[0]))
print("Each sample has {} features".format(x_train.shape[1]))

There are 263 samples in the training dataset
There are 88 samples in the testing dataset
Each sample has 34 features

近邻算法思想：为了对新个体分类，查找训练集，找到与新个体最相似的那些个体，这些个体大多属于的类别就是新个体的类别。
该算法在特征值取离散值时效果很差。
为了实现这个算法，重点需要衡量两个个体的相似度，在文章开头介绍过的几种距离都是常用的方式。

然后我们借助scikit-learn来实现：
导入k近邻估计器：

from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier()

开始训练:

estimator.fit(x_train,y_train)

输出：KNeighborsClassifier(algorithm=‘auto’, leaf_size=30, metric=‘minkowski’,
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights=‘uniform’)

4 算法评估

评估算法：

y_predicted = estimator.predict(x_test)
accuracy = np.mean(y_test == y_predicted)*100
print("The accuracy is {0:.1f}%".format(accuracy))

The accuracy is 86.4%

5 交叉检验

avg_scores = []
all_scores = []
parameter_values = list(range(1,21))
for n_neighbors in parameter_values:
    estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
    scores = cross_val_score(estimator,x,y,scoring='accuracy')
    avg_scores.append(np.mean(scores))
    all_scores.append(scores)

可能会报futureWarning，有强迫症的话参考笔者之前写过的一篇博客消futureWarning：https://blog.csdn.net/qq_33583069/article/details/89387196

作图

from matplotlib import pyplot as plt
plt.figure(figsize=(32,20))
plt.plot(parameter_values,avg_scores,'-o',linewidth=5,markersize=24)

for parameter,scores in zip(parameter_values,all_scores):
    n_scores = len(scores)
    plt.plot([parameter] * n_scores,scores,'-o')

plt.plot(parameter_values,all_scores,'bx')

from collections import defaultdict
all_scores = defaultdict(list)
parameter_values = list(range(1, 21))  # Including 20
for n_neighbors in parameter_values:
    for i in range(100):
        estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
        scores = cross_val_score(estimator, X, y, scoring='accuracy', cv=10)
        all_scores[n_neighbors].append(scores)
for parameter in parameter_values:
    scores = all_scores[parameter]
    n_scores = len(scores)
    plt.plot([parameter] * n_scores, scores, '-o')