Python数据挖掘 # 2.亲和性分析 [实验：商品推荐]

0 前置要求

python编程基础、numpy模块基本操作。

1 定义

亲和性分析指的是根据样本个体之间的相似度来确定它们之间关系的亲疏

2 应用

向网站用户提供多样性化的服务或投放定向广告
为了向用户推荐电影或商品，而卖给他们一些与之相关的商品

3 实例

我们通过计算购买不同商品之间的相关性来分析商品之间的亲和性，比如说某个用户“在购买苹果之后，再购买来香蕉”，那么对于该用户来说，“苹果”和“香蕉”这两个样本具有一定的亲和性。当然这个只是一种简单的假设，并没有大数据的统计和分析。

规则：“如果顾客购买来商品X，那么他们可能购买商品Y”
支持度：指数据集中规则应验的次数，有时候可能需要对支持度进行规范化
置信度：指的是规则准确率，即符合给定条件的所有规则中，跟当前规则结论一致的比例

*多件商品的规则会很复杂，此处做了一定的简化

4 生成数据

假设有5中商品：[“bread”, “milk”, “cheese”, “apples”, “bananas”]，然后构建一个字典，用来存取两种商品是否存在关联，比如dict[1,3] = 1就表示“顾客在购买milk之后，又购买了apples”

import numpy as np
X = np.zeros((100, 5), dtype='bool')
features = ["bread", "milk", "cheese", "apples", "bananas"]
for i in range(X.shape[0]):
    if np.random.random() < 0.3:
        # A bread winner
        X[i][0] = 1
        if np.random.random() < 0.5:
            # Who likes milk
            X[i][1] = 1
        if np.random.random() < 0.2:
            # Who likes cheese
            X[i][2] = 1
        if np.random.random() < 0.25:
            # Who likes apples
            X[i][3] = 1
        if np.random.random() < 0.5:
            # Who likes bananas
            X[i][4] = 1
    else:
        # Not a bread winner
        X[i][0] = 0
        if np.random.random() < 0.5:
            # Who like milk
            X[i][1] = 1
            if np.random.random() < 0.2:
                # Who likes cheese
                X[i][2] = 1
            if np.random.random() < 0.25:
                # Who likes apples
                X[i][3] = 1
            if np.random.random() < 0.5:
                # Who likes bananas
                X[i][4] = 1
        else:
            if np.random.random() < 0.8:
                # Who likes cheese
                X[i][2] = 1
            if np.random.random() < 0.6:
                # Who likes apples
                X[i][3] = 1
            if np.random.random() < 0.7:
                # Who likes bananas
                X[i][4] = 1
    if X[i].sum() == 0:
        X[i][4] = 1; # Must buy something, so gets bananas

print(X[:5])

np.savetxt("affinity_dataset.txt", X, fmt="%d")

5 处理数据

import numpy as np
dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
n_samples,n_features = X.shape
print("This dataset has {0} samples and {1} features".format(n_samples,n_features))

Output

This dataset has 100 samples and 5 features

print(X[:5])

Output

[[0. 0. 1. 1. 1.]
[1. 1. 0. 1. 0.]
[1. 0. 1. 1. 0.]
[0. 0. 1. 1. 1.]
[0. 1. 0. 0. 1.]]

每一行代表每一个顾客的购买商品数据，1代表购买了该商品，0代表未购买。

features = ['bread','milk','cheese','apples','bananas']
num_apple_purchases = 0
for sample in X:
    if sample[3]==1:
        num_apple_purchases += 1
print("{0} people bought Apples".format(num_apple_purchases))

Output

36 people bought Apples

我们不妨猜想存在规则：购买了apples的人很有可能购买bananas
接下来是验证我们的猜想：（数据结果跟你生成的数据集有关），文末附上本文测试数据（如果是使用本文的数据那么测试结果应该是一致的）

rule_valid = 0
rule_invalid = 0
for sample in X:
    if sample[3] == 1:
        if sample[4] == 1:
            rule_valid += 1
        else:
            rule_invalid += 1
print("{0} cases of the rule being valid were discovered".format(rule_valid))
print("{0} cases of the rule being invalid were discovered".format(rule_invalid))

output

21 cases of the rule being valid were discovered
15 cases of the rule being invalid were discovered

下面计算规则置信度：

support = rule_valid 
confidence = rule_valid / num_apple_purchases
print("The support is {0} and the confidence is {1:.3f}".format(support,confidence))
print("As a percentage,that is {0:.1f}%".format(100*confidence))

output:

The support is 21 and the confidence is 0.583
As a percentage,that is 58.3%

下面开始挖掘任意规则：（假设只有一种规则形式：购买A的人会购买B A->B）

from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0:continue
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:
                continue
            if sample[conclusion] == 1:
                valid_rules[(premise,conclusion)] += 1
            else :
                invalid_rules[(premise,conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise,conclusion in valid_rules.keys():
    confidence[(premise,conclusion)] = valid_rules[(premise,conclusion)]/num_occurences[premise]

输出规则：

for premise,conclusion in confidence.keys():
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will alse buy {1}".format(premise_name,conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))
    print(" - Support: {0}".format(support[(premise,conclusion)]))
    print("")

Output:

Rule: If a person buys cheese they will alse buy apples

Confidence: 0.610
Support: 25

Rule: If a person buys cheese they will alse buy bananas

Confidence: 0.659
Support: 27

Rule: If a person buys apples they will alse buy cheese

Confidence: 0.694
Support: 25

Rule: If a person buys apples they will alse buy bananas

Confidence: 0.583
Support: 21

Rule: If a person buys bananas they will alse buy cheese

Confidence: 0.458
Support: 27

Rule: If a person buys bananas they will alse buy apples

Confidence: 0.356
Support: 21

Rule: If a person buys bread they will alse buy milk

Confidence: 0.519
Support: 14

Rule: If a person buys bread they will alse buy apples

Confidence: 0.185
Support: 5

Rule: If a person buys milk they will alse buy bread

Confidence: 0.304
Support: 14

Rule: If a person buys milk they will alse buy apples

Confidence: 0.196
Support: 9

Rule: If a person buys apples they will alse buy bread

Confidence: 0.139
Support: 5

Rule: If a person buys apples they will alse buy milk

Confidence: 0.250
Support: 9

Rule: If a person buys bread they will alse buy cheese

Confidence: 0.148
Support: 4

Rule: If a person buys cheese they will alse buy bread

Confidence: 0.098
Support: 4

Rule: If a person buys milk they will alse buy bananas

Confidence: 0.413
Support: 19

Rule: If a person buys bananas they will alse buy milk

Confidence: 0.322
Support: 19

Rule: If a person buys bread they will alse buy bananas

Confidence: 0.630
Support: 17

Rule: If a person buys bananas they will alse buy bread

Confidence: 0.288
Support: 17

Rule: If a person buys milk they will alse buy cheese

Confidence: 0.152
Support: 7

Rule: If a person buys cheese they will alse buy milk

Confidence: 0.171
Support: 7

上面应该有20条规则（全部规则）
然后我们应该对上面计算出的规则进行进一步处理。
将上面的输出单独写一个函数：

def print_rule(premise,conclusion,support,confidence,features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

测试这个函数：

premise = 1
conclusion = 3
print_rule(premise,conclusion,support,confidence,features)

Rule: If a person buys milk they will also buy apples

Confidence: 0.196
Support: 9

查看support

from pprint import pprint
pprint(list(support.items()))

[((2, 3), 25),
((2, 4), 27),
((3, 2), 25),
((3, 4), 21),
((4, 2), 27),
((4, 3), 21),
((0, 1), 14),
((0, 3), 5),
((1, 0), 14),
((1, 3), 9),
((3, 0), 5),
((3, 1), 9),
((0, 2), 4),
((2, 0), 4),
((1, 4), 19),
((4, 1), 19),
((0, 4), 17),
((4, 0), 17),
((1, 2), 7),
((2, 1), 7)]

上面已经计算完所有的规则，我们现在来对规则进行排序：

from operator import itemgetter
sorted_support = sorted(support.items(),key=itemgetter(1),reverse=True)

查看置信度最高的前5条规则：

for index in range(5):
    print("Rule #{0}".format(index+1))
    (premise,conclusion) = sorted_support[index][0]
    print_rule(premise,conclusion,support,confidence,features)

Rule #1
Rule: If a person buys cheese they will also buy bananas

Confidence: 0.659
Support: 27

Rule #2
Rule: If a person buys bananas they will also buy cheese

Confidence: 0.458
Support: 27

Rule #3
Rule: If a person buys cheese they will also buy apples

Confidence: 0.610
Support: 25

Rule #4
Rule: If a person buys apples they will also buy cheese

Confidence: 0.694
Support: 25

Rule #5
Rule: If a person buys apples they will also buy bananas

Confidence: 0.583
Support: 21

附录

affinity_dataset.txt