Frequent Itemset Mining and Association Rules
机器学习
Frequent Itemset
Qustion: Find sets of items appear together frequently in baskets.
Support for itemset I: Number of baskets containing I. Given a support threshold s, then sets of items appear in at least s baskets are called frequent itemsets.
Association Rules: {i1,…,ik}→j means "if a basket contains all of i1,…,ik then it is likely to contain j". Confidence of association rule is
conf(I→j)=support(I∪j)support(I)
Not all high-confidence rules are interesting:
Interest(I→j)=conf(I→j)−P[j]
Interesting rules are those with high postive or negative interest values.
A-Priori
A-Priori finds frequent itemsets.
Key idea: If a set of items I appears at least s times, so does every subset J of I. If item i does not appear in s baskets, then no pair including i can appear in s baskets.
- Read baskets and count in main memory the occurrences of each individual item.
- Read baskets again and count in main memory only those pairs where both elements are frequent (2-tuple).
- For each k, we construct two sets of k-tuples:
- Generate candidate k-tuple Ck from Lk−1 and L1.
- Lk= the set of truly frequent k-tuples.
Note that to generate Ck from Lk−1 and L1, one should be careful. For example, in C3 we know {b,m,j} cannot be frequent since {m,j} is not frequent.