Python数据挖掘 # 3.分类 [实验：Iris植物分类] (OneR算法)

0 实验环境

python 3.7.0

scikit-learn

1 准备数据集

scikit-learn内置了该数据集,故只需要安装了scikit-learn包即可。

2 导入数据

import numpy as np
from sklearn.datasets import load_iris
dataset = load_iris()
x = dataset.data
y = dataset.target
print(dataset.DESCR)
n_samples,n_features = x.shape

Iris plants dataset

Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher’s paper. Note that it’s the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher’s paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
… topic:: References

Fisher, R.A. “The use of multiple measurements in taxonomic problems”
Annual Eugenics, 7, Part II, 179-188 (1936); also in “Contributions to
Mathematical Statistics” (John Wiley, NY, 1950).

Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.

Dasarathy, B.V. (1980) “Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments”. IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.

Gates, G.W. (1972) “The Reduced Nearest Neighbor Rule”. IEEE Transactions
on Information Theory, May 1972, 431-433.

See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.

Many, many more …

150行数据，共4列，分别有3类。

3 处理数据

查看数据:

print(x[:5])

[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]

任务是根据这些数据的4个特征抽象出三个类别各自的类别特征值。
数据集是连续的，把连续数据转化为类别，这个过程叫做离散化。
作为初学者，最简单的离散化算法是确定一个阈值，然后将低于该阈值的特征设为0，高于阈值设为1。
然后把某项特征的阈值设置为该特征所有特征值的均值。
均值计算代码：

attribute_means = x.mean(axis=0)
assert attribute_means.shape == (n_features,)#断言 
x_d = np.array(x >= attribute_means,dtype='int')

得到打散后的数据集x_d。

我们对数据集进行分割。

from sklearn.model_selection import train_test_split
random_state = 14
x_train,x_test,y_train,y_test = train_test_split(x_d,y,random_state=random_state)
print("There are {} training samples".format(y_train.shape))
print("There are {} testing samples".format(y_test.shape))

Notice
random_state是随机种子，但是random_state和数据集相同时划分结果也是相同的(设置为None时是真正的随机的，但是为了验证结果正确性，故此处建议和笔者采用相同的数值)。
所以output应该是：

There are (112,) training samples
There are (38,) testing samples

4 OneR算法实现

OneR算法：根据已有数据集中，具有相同特征值的个体最可能属于哪个类别来分类。
OneR（OneRule的缩写），只选择4个特征中分类效果最好的特征进行分类。

算法首先遍历每个特征的每一个取值，对每一个特征值，统计它在各个类别中出现的次数。
找到它出现次数最多的类别，并统计它在其他类别中的出现次数。

然后我们计算错误率，选取错误率最低的特征作为唯一的分类准则（OneR）用于接下来的分类。

from collections import defaultdict
from operator import itemgetter
def train(x,y_true,feature):
    n_samples,n_features = x.shape
    assert 0 <= feature < n_features
    values = set(x[:,feature]) #提取特征向量
    predictors = dict()
    errors = []
    for current_value in values: #遍历特征值
        most_frequent_class,error = train_feature_value(x,y_true,feature,current_value)
        predictors[current_value] = most_frequent_class #预测
        errors.append(error)
    total_error = sum(errors)
    return predictors,total_error
#参数：数据集、类别数组、特征、特征值
def train_feature_value(x,y_true,feature,value):
    class_counts = defaultdict(int)
    for sample,y in zip(x,y_true): #zip用于打包，同时遍历
        if sample[feature] == value:
            class_counts[y] += 1 
    sorted_class_counts = sorted(class_counts.items(),key = itemgetter(1),reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    error = sum([class_count for class_value,class_count in class_counts.items()
                    if class_value != most_frequent_class])
    return most_frequent_class,error #返回这一特征和特征值对应的情况最多的类别、以及错误率

5 测试算法

all_predictors = {variable:train(x_train,y_train,variable) for variable in range(x_train.shape[1])}#训练集的特征维度
errors = {variable:error for variable,(mapping,error) in all_predictors.items()}
best_variable,best_error = sorted(errors.items(),key=itemgetter(1))[0]
print("The best model is based on variable {0} ans has error {1:.2f}".format(best_variable,best_error))
model = {'variable':best_variable,
         'predictor':all_predictors[best_variable][0]}
print(model)

The best model is based on variable 2 ans has error 37.00
{‘variable’: 2, ‘predictor’: {0: 0, 1: 2}}

可见特征2是分类效果最好的特征，特征2值为0是类别为1、为1时类别为2的分类错误率在训练集中最小。
（仔细观察errors好像没有意义，上面代码的等价写法,实现 对键值为元组的字典进行排序，errors可以省掉）

all_predictors = {variable:train(x_train,y_train,variable) for variable in range(x_train.shape[1])}#训练集的特征维度
# errors = {variable:error for variable,(mapping,error) in all_predictors.items()}
best_variable,(t,best_error) = sorted(all_predictors.items(),key=lambda all_predictors:all_predictors[1][1])[0]
print("The best model is based on variable {0} ans has error {1:.2f}".format(best_variable,best_error))
model = {'variable':best_variable,
         'predictor':all_predictors[best_variable][0]}
print(model)

我们往往要对多组数据进行测试，故编写函数：

def predict(x_test,model):
    variable = model['variable']
    predictor = model['predictor']
    y_predicted = np.array([predictor[int(sample[variable])] for sample in x_test])
    return y_predicted

测试：

y_predicted = predict(x_test,model)
print(y_predicted)

output:

[0 0 0 2 2 2 0 2 0 2 2 0 2 2 0 2 0 2 2 2 0 0 0 2 0 2 0 2 2 0 0 0 2 0 2 0 2 2]

计算准确率

accuracy = np.mean(y_predicted == y_test)*100
print("The test accuracy is {:.1f}%".format(accuracy))

The test accuracy is 65.8%