Python for Data Science Chapter 6 - Other Popular Machine Learning Methods Import the required libraries Data Format Data Coversion Support Calculation Association Rules
Segment 1 - Association Rule Mining Using Apriori Algorithm
Association Rule Mining
Association rule mining is a process that deploys pattern recognition to identify and quantify relationships between different, yet related items.
A Simple Association Rules Use Case
- Popular use case: product placement optimization at both brick and mortar and ecommerce stores.
Advantages of Association Rules
Fast
Works with small data
Feature Engineering
The term feature engineering refers to the process of engineering data into a predictive feature that fits the requirements (and/or improves the performance) of a machine learning model.
Three Ways to Measure Association
- Support:
Support is the relative frequency of an item within a dataset. Support for an item can be calculated as:
- Confidence:
Confidence is the probability of seeing the consequent item (a "then" term) within data, given that the data also contains the antecedent(the "if" term) item.
In other words, confidence tells you:
THEN How likely it is for 1 item to be purchase given that,
IF another item is purchased.
Confidence determines how many if-then statements are found to be true within a dataset.
- Lift
Lift is a metric that measures how much more often the antecedent and consequent occur together rather than them occurring independently.
Lift Scores
- Lift score > 1: A is highly associated with C. If A is purchased, it is likely that C will also be purchased
- Lift score < 1: If A is purchased, it is unlikely that C will be purchased
- Lift score = 1: Indicates that there is no association between items A and C
Where Apriori Fits In
The Apriori algorithm is the algorithm that you use to implement association rule mining over structured data.
Import the required libraries
pip install mlxtend
Defaulting to user installation because normal site-packages is not writeable
Collecting mlxtend
Downloading mlxtend-0.18.0-py2.py3-none-any.whl (1.3 MB)
[K |████████████████████████████████| 1.3 MB 1.1 MB/s eta 0:00:01
[?25hRequirement already satisfied: scipy>=1.2.1 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (1.5.4)
Requirement already satisfied: pandas>=0.24.2 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (1.1.5)
Requirement already satisfied: setuptools in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (51.1.0.post20201221)
Requirement already satisfied: matplotlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from mlxtend) (3.1.1)
Requirement already satisfied: joblib>=0.13.2 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (1.0.0)
Requirement already satisfied: numpy>=1.16.2 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (1.19.4)
Requirement already satisfied: scikit-learn>=0.20.3 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (0.24.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.0->mlxtend) (2.8.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.0->mlxtend) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.0->mlxtend) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.0->mlxtend) (2.4.2)
Requirement already satisfied: six in /home/ericwei/.local/lib/python3.7/site-packages (from cycler>=0.10->matplotlib>=3.0.0->mlxtend) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /home/ericwei/.local/lib/python3.7/site-packages (from pandas>=0.24.2->mlxtend) (2020.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/ericwei/.local/lib/python3.7/site-packages (from scikit-learn>=0.20.3->mlxtend) (2.1.0)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.18.0
[33mWARNING: You are using pip version 20.3.3; however, version 21.0 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
Data Format
address = '~/Data/groceries.csv'
data = pd.read_csv(address)
data.head()
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|
0 | citrus fruit | semi-finished bread | margarine | ready soups | NaN | NaN | NaN | NaN | NaN |
1 | tropical fruit | yogurt | coffee | NaN | NaN | NaN | NaN | NaN | NaN |
2 | whole milk | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | pip fruit | yogurt | cream cheese | meat spreads | NaN | NaN | NaN | NaN | NaN |
4 | other vegetables | whole milk | condensed milk | long life bakery product | NaN | NaN | NaN | NaN | NaN |
Data Coversion
basket_sets = pd.get_dummies(data)
basket_sets.head()
1_Instant food products | 1_UHT-milk | 1_artif. sweetener | 1_baby cosmetics | 1_bags | 1_baking powder | 1_bathroom cleaner | 1_beef | 1_berries | 1_beverages | ... | 9_sweet spreads | 9_tea | 9_vinegar | 9_waffles | 9_whipped/sour cream | 9_white bread | 9_white wine | 9_whole milk | 9_yogurt | 9_zwieback | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 1113 columns
Support Calculation
apriori(basket_sets, min_support=0.02)
support | itemsets | |
---|---|---|
0 | 0.030421 | (7) |
1 | 0.034951 | (17) |
2 | 0.029126 | (23) |
3 | 0.049191 | (26) |
4 | 0.064401 | (47) |
5 | 0.044660 | (83) |
6 | 0.024272 | (90) |
7 | 0.040453 | (92) |
8 | 0.038835 | (99) |
9 | 0.033981 | (100) |
10 | 0.076052 | (105) |
11 | 0.028803 | (111) |
12 | 0.044984 | (123) |
13 | 0.073463 | (130) |
14 | 0.022977 | (131) |
15 | 0.028803 | (159) |
16 | 0.058900 | (217) |
17 | 0.022977 | (224) |
18 | 0.040129 | (232) |
19 | 0.036893 | (233) |
20 | 0.031068 | (243) |
21 | 0.034628 | (256) |
22 | 0.062136 | (263) |
23 | 0.028479 | (264) |
24 | 0.045955 | (351) |
25 | 0.033010 | (366) |
26 | 0.024272 | (378) |
27 | 0.057929 | (397) |
28 | 0.023301 | (398) |
29 | 0.020712 | (479) |
30 | 0.024595 | (497) |
31 | 0.024272 | (510) |
32 | 0.033333 | (531) |
33 | 0.023301 | (532) |
34 | 0.020065 | (631) |
35 | 0.021036 | (217, 397) |
apriori(basket_sets, min_support=0.02, use_colnames=True)
support | itemsets | |
---|---|---|
0 | 0.030421 | (1_beef) |
1 | 0.034951 | (1_canned beer) |
2 | 0.029126 | (1_chicken) |
3 | 0.049191 | (1_citrus fruit) |
4 | 0.064401 | (1_frankfurter) |
5 | 0.044660 | (1_other vegetables) |
6 | 0.024272 | (1_pip fruit) |
7 | 0.040453 | (1_pork) |
8 | 0.038835 | (1_rolls/buns) |
9 | 0.033981 | (1_root vegetables) |
10 | 0.076052 | (1_sausage) |
11 | 0.028803 | (1_soda) |
12 | 0.044984 | (1_tropical fruit) |
13 | 0.073463 | (1_whole milk) |
14 | 0.022977 | (1_yogurt) |
15 | 0.028803 | (2_citrus fruit) |
16 | 0.058900 | (2_other vegetables) |
17 | 0.022977 | (2_pip fruit) |
18 | 0.040129 | (2_rolls/buns) |
19 | 0.036893 | (2_root vegetables) |
20 | 0.031068 | (2_soda) |
21 | 0.034628 | (2_tropical fruit) |
22 | 0.062136 | (2_whole milk) |
23 | 0.028479 | (2_yogurt) |
24 | 0.045955 | (3_other vegetables) |
25 | 0.033010 | (3_rolls/buns) |
26 | 0.024272 | (3_soda) |
27 | 0.057929 | (3_whole milk) |
28 | 0.023301 | (3_yogurt) |
29 | 0.020712 | (4_other vegetables) |
30 | 0.024595 | (4_rolls/buns) |
31 | 0.024272 | (4_soda) |
32 | 0.033333 | (4_whole milk) |
33 | 0.023301 | (4_yogurt) |
34 | 0.020065 | (5_rolls/buns) |
35 | 0.021036 | (3_whole milk, 2_other vegetables) |
df = basket_sets
frequent_itemsets = apriori(basket_sets, min_support=0.002, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
support | itemsets | length | |
---|---|---|---|
0 | 0.006472 | (1_UHT-milk) | 1 |
1 | 0.030421 | (1_beef) | 1 |
2 | 0.011974 | (1_berries) | 1 |
3 | 0.008414 | (1_beverages) | 1 |
4 | 0.014887 | (1_bottled beer) | 1 |
... | ... | ... | ... |
844 | 0.002265 | (6_whole milk, 3_pip fruit, 5_other vegetables) | 3 |
845 | 0.002589 | (4_other vegetables, 5_whole milk, 3_root vege... | 3 |
846 | 0.002913 | (5_yogurt, 4_curd, 3_whole milk) | 3 |
847 | 0.003236 | (6_whole milk, 4_root vegetables, 5_other vege... | 3 |
848 | 0.002265 | (6_whole milk, 7_butter, 5_other vegetables) | 3 |
849 rows × 3 columns
frequent_itemsets[frequent_itemsets['length'] >= 3]
support | itemsets | length | |
---|---|---|---|
820 | 0.002589 | (2_root vegetables, 1_beef, 3_other vegetables) | 3 |
821 | 0.002589 | (3_whole milk, 2_other vegetables, 1_chicken) | 3 |
822 | 0.002589 | (1_citrus fruit, 3_whole milk, 2_other vegetab... | 3 |
823 | 0.003236 | (1_citrus fruit, 2_tropical fruit, 3_pip fruit) | 3 |
824 | 0.002589 | (1_citrus fruit, 4_whole milk, 3_other vegetab... | 3 |
825 | 0.002265 | (1_frankfurter, 6_whole milk, 5_other vegetables) | 3 |
826 | 0.002265 | (3_other vegetables, 4_whole milk, 1_pork) | 3 |
827 | 0.003560 | (1_root vegetables, 3_whole milk, 2_other vege... | 3 |
828 | 0.002589 | (1_sausage, 2_rolls/buns, 3_soda) | 3 |
829 | 0.002265 | (3_other vegetables, 1_sausage, 4_whole milk) | 3 |
830 | 0.002265 | (1_sausage, 4_other vegetables, 5_whole milk) | 3 |
831 | 0.002913 | (1_tropical fruit, 3_whole milk, 2_other veget... | 3 |
832 | 0.002265 | (4_other vegetables, 5_whole milk, 2_citrus fr... | 3 |
833 | 0.002265 | (4_butter, 2_other vegetables, 3_whole milk) | 3 |
834 | 0.003560 | (4_curd, 3_whole milk, 2_other vegetables) | 3 |
835 | 0.003883 | (4_yogurt, 3_whole milk, 2_other vegetables) | 3 |
836 | 0.002265 | (3_whole milk, 2_other vegetables, 6_rolls/buns) | 3 |
837 | 0.003236 | (3_other vegetables, 4_whole milk, 2_pip fruit) | 3 |
838 | 0.005825 | (2_root vegetables, 4_whole milk, 3_other vege... | 3 |
839 | 0.002265 | (4_other vegetables, 2_tropical fruit, 3_pip f... | 3 |
840 | 0.003560 | (5_butter, 4_whole milk, 3_other vegetables) | 3 |
841 | 0.002913 | (3_other vegetables, 4_whole milk, 5_yogurt) | 3 |
842 | 0.003560 | (3_other vegetables, 6_yogurt, 4_whole milk) | 3 |
843 | 0.002265 | (3_pip fruit, 4_root vegetables, 5_other veget... | 3 |
844 | 0.002265 | (6_whole milk, 3_pip fruit, 5_other vegetables) | 3 |
845 | 0.002589 | (4_other vegetables, 5_whole milk, 3_root vege... | 3 |
846 | 0.002913 | (5_yogurt, 4_curd, 3_whole milk) | 3 |
847 | 0.003236 | (6_whole milk, 4_root vegetables, 5_other vege... | 3 |
848 | 0.002265 | (6_whole milk, 7_butter, 5_other vegetables) | 3 |
Association Rules
Confidence
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules.head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
0 | (2_sausage) | (1_frankfurter) | 0.011327 | 0.064401 | 0.011327 | 1.000000 | 15.527638 | 0.010597 | inf |
1 | (7_pastry) | (1_frankfurter) | 0.005178 | 0.064401 | 0.002589 | 0.500000 | 7.763819 | 0.002256 | 1.871197 |
2 | (2_ham) | (1_sausage) | 0.007120 | 0.076052 | 0.004531 | 0.636364 | 8.367505 | 0.003989 | 2.540858 |
3 | (2_meat) | (1_sausage) | 0.006796 | 0.076052 | 0.004854 | 0.714286 | 9.392097 | 0.004338 | 3.233819 |
4 | (3_beef) | (1_sausage) | 0.004854 | 0.076052 | 0.002589 | 0.533333 | 7.012766 | 0.002220 | 1.979889 |
Lift
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.5)
rules.head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
0 | (1_beef) | (2_citrus fruit) | 0.030421 | 0.028803 | 0.005502 | 0.180851 | 6.278986 | 0.004625 | 1.185618 |
1 | (2_citrus fruit) | (1_beef) | 0.028803 | 0.030421 | 0.005502 | 0.191011 | 6.278986 | 0.004625 | 1.198508 |
2 | (1_beef) | (2_other vegetables) | 0.030421 | 0.058900 | 0.003236 | 0.106383 | 1.806173 | 0.001444 | 1.053136 |
3 | (2_other vegetables) | (1_beef) | 0.058900 | 0.030421 | 0.003236 | 0.054945 | 1.806173 | 0.001444 | 1.025950 |
4 | (2_root vegetables) | (1_beef) | 0.036893 | 0.030421 | 0.005502 | 0.149123 | 4.902016 | 0.004379 | 1.139506 |
Lift and Confidence
rules[(rules['lift'] >= 5) & (rules['confidence'] >=0.5)]
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
94 | (2_sausage) | (1_frankfurter) | 0.011327 | 0.064401 | 0.011327 | 1.000000 | 15.527638 | 0.010597 | inf |
141 | (7_pastry) | (1_frankfurter) | 0.005178 | 0.064401 | 0.002589 | 0.500000 | 7.763819 | 0.002256 | 1.871197 |
243 | (2_ham) | (1_sausage) | 0.007120 | 0.076052 | 0.004531 | 0.636364 | 8.367505 | 0.003989 | 2.540858 |
247 | (2_meat) | (1_sausage) | 0.006796 | 0.076052 | 0.004854 | 0.714286 | 9.392097 | 0.004338 | 3.233819 |
262 | (3_beef) | (1_sausage) | 0.004854 | 0.076052 | 0.002589 | 0.533333 | 7.012766 | 0.002220 | 1.979889 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
962 | (6_whole milk, 4_root vegetables) | (5_other vegetables) | 0.003883 | 0.012621 | 0.003236 | 0.833333 | 66.025641 | 0.003187 | 5.924272 |
964 | (4_root vegetables, 5_other vegetables) | (6_whole milk) | 0.005178 | 0.009385 | 0.003236 | 0.625000 | 66.594828 | 0.003188 | 2.641640 |
968 | (6_whole milk, 7_butter) | (5_other vegetables) | 0.002913 | 0.012621 | 0.002265 | 0.777778 | 61.623932 | 0.002229 | 4.443204 |
970 | (7_butter, 5_other vegetables) | (6_whole milk) | 0.002589 | 0.009385 | 0.002265 | 0.875000 | 93.232759 | 0.002241 | 7.924919 |
972 | (7_butter) | (6_whole milk, 5_other vegetables) | 0.004207 | 0.007443 | 0.002265 | 0.538462 | 72.341137 | 0.002234 | 2.150539 |
76 rows × 9 columns