## Itemset

**Itemset**: A set of one or more items

**k-itemset**: An itemset containing k items

**support:**

- absolute support: sup{x} occurrences of an itemset X
- relative support: s{x} probability that a transaction contains X

**frequent:** An itemset (or a pattern) X is frequent if the support of X is no less than a `minsup`

threshold σ

{Beer, Diaper}: 3/5 (60%) Beer: 3/5 (60%), if σ = 0.3, they are frequent

**confident:** The conditional probability that a transaction containing X also contains Y

## Compress Pattern

find close pattern instead of all pattern

**close pattern:** pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y כ X, with the same support as X

T1: {a1, ..., a50}; T2: {a1, ..., a100}, close pattern: P1: “{a1, ..., a50}: 2”; P2: “{a1, ..., a100}: 1”

- Closed pattern is a
*lossless compression*of frequent patterns

**max pattern:** pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X

T1: {a1, ..., a50}; T2: {a1, ..., a100}, max pattern: P: “{a1, ..., a100}: 1”

- Max-pattern is a
*lossy compression*, that is it will cause data loss - in many application, mining closed-patterns is more desirable than mining max-patterns

## Mining Methods

### Apriori Algorithm

- BFS algorithm to find itemset frequent pattern

**downward closure**: Any subset of a frequent itemset must be frequent, If ANY subset of an itemset S is infrequent, then there is no chance for S to be frequent (use for pruning).

**candidate generation:**

- joining: the set A, B can only be joined if their first k-1 items are the same
- pruning: use downward closure

**partitioning:**

partitioning is a way to reduce passes of transaction database scans

- Any itemset that is potentially frequent in TDB must be frequent in at least one of the partitions of TDB
- $min_{support}$ in partitioned zone = $min_{support}$ $\times$ events number in this partition zone
- only need to scan the DB twice to find frequent item

### FP-Growth

- DFS algorithm to find itemset frequent pattern, can use to deal with huge amount of data

**process:**

- scan DB, find single item frequent pattern
- sort frequent item in frequent descending
scan DB again, find the ordered frequent itemlist for each transaction

- FP-growth is faster than Apriori for most of situations
- both FP and Apriori are mining from horizontal data format (TID: data)

## Pattern Evaluate

**lift:** Measure of dependent/correlated events

$$ lift=\frac{s(B\cup C)}{s(B)\times s(C)} $$

- $lift(B, C) = 1$: B and C are independent
- $> 1$: positively correlated
- $< 1$: negatively correlated

**$\chi^2$**: another way to measure test correlated events

$$ \chi^2 =\sum\frac{(Observed -Expected)^2}{Expected} $$

for both $\chi ^ 2$ and *lift* measure, they might be influenced by null transactions

**Null invariance: **The number of null transactions does not matter, does not change the measure value.

## Imbalance Ratio (IR)

measure the imbalance of two itemsets A and B in rule implications

$$ IR(A,B)=\frac{|s(A)-s(B)|}{s(A)+s(B)-s(A\cup B)} $$

- if $IR=0$, means two events are balanced
- Lift and χ2 are good measures if null transactions are not predominant
- Otherwise, Kulczynski + Imbalance Ratio should be used to judge the interestingness of a pattern

## Mining Multilevel Patterns

**min-support:** there are two min-support thresholds

- uniform: same min-support across multiple levels
- level-reduced: min-support reduced as level become lower

**redundancy filtering:** redundant due to “ancestor” relationships

A rule is redundant if its support is close to the “expected” value, according to its “ancestor” rule, and it has a similar confidence as its “ancestor”

- milk $\rightarrow$ wheat bread [support = 8%, confidence = 70%] (1)
- 2% milk $\rightarrow$ wheat bread [support = 2%, confidence = 72%] (2)
- Suppose the 2% milk sold is about ¼ of milk sold in gallons, then, the data above is redundant

**customized min-supports: **to analyze the item with less frequency (low support), we can use group-based min-support

{diamond, watch}: 0.05%; {bread, milk}: 5%

**quantitative associations:**

methods used when mining associations with numerical attributes

- Static discretization: age: {0-10, 10-20, ..., 90-100} → {young, mid-aged, old}
- Dynamic discretization based on data distribution
- Clustering: Distance-based association
- Deviation analysis: Gender = female $\rightarrow$ Wage: mean=$7/hr

**mining extraordinary phenomena:**

check if LHS (a subset of the population) behave extraordinarily compared to RHS (an extraordinary behavior of this subset)

- The rule is accepted only if a statistical test (e.g., Z-test) confirms the inference with high confidence

**negative pattern: **Unlikely to happen together

If itemsets A and B are both frequent but rarely occur together, then:

$$ s(A\cup B) << s(A)\times s(B) $$

the equation above will be influenced by the null transaction, we can use Kulczynski-measure:

If itemsets A and B are frequent but: ($є$: negative pattern threshold)

$$ (\frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)})\cdot \frac{1}{2} < є $$

then A and B are negatively correlated.

## Constraint-Based Mining

Mine together with user-provided constraints

**type of constrains:**

- Knowledge type constraint—Specifying what kinds of knowledge to mine
- Data constraint—using SQL-like queries
- Dimension/level constraint—similar to projection in relational database
- Interestingness constraint—various kinds of thresholds
**Rule (or pattern) constraint**

**characteristic of constrain:**

- Anti-Monotonicity: set violated, then its superset must violate as well e.g. $support(S) \rightarrow \sigma$
- Monotone: if set satisfied, then its superset must satisfied as well. e.g. $sum(S.Price) \rightarrow v$
- Convertible: Convert tough constraints into (anti-)monotone by proper ordering of items in transactions. e.g. $avg(S.profit)>20$
- Data anti-monotone: In the mining process, if a data entry t cannot contribute to a pattern p satisfying c, t cannot contribute to p’s superset
- Succinctness: If the constraint c can be enforced by directly manipulating the data
- Data succinct: Data space can be pruned at the initial pattern mining process

## Download Note

本文作者：TwinIsland

版权声明：所有文章除特别声明外均系本人自主创作，转载及引用请注明出处

Tass412之光🥲

TwinIsland回复@Tass这课要给我变成光了💔