data mining function:

  1. generalization
  2. pattern discovery
  3. classification
  4. Cluster Analysis
  5. Outliers Analysis
  6. Time and Ordering: Sequential Pattern, Trend and Evolution Analysis
  7. Structure and Network Analysis Graph mining


characteristic of structured data:

  1. Dimensionality

    1. Curse of dimensionality
  2. Sparsity

    1. Only presence counts
  3. Resolution

    1. Patterns depend on the scale
  4. Distribution

    1. Centrality and dispersion
  • Data sets are made up of data objects
  • Data objects are described by attributes


  • dimensions, features, variables


  1. Nominal: auburn, black, blond, brown, grey, red, white
  2. Binary: 0/1
  3. ordinal: small, medium, large
  4. Interval: Measured on a scale of equal sized units temperature
  5. Ratio: inherent zero-point temperaturer in kelven, count

type 2:

  1. Discrete Attribute: a finite or countably infinite set of values zip code, profession
  2. Continuous Attribute: Has real numbers as attribute values height, weight

statistical measurement:

mean: $\bar x=\frac1n\sum^n_{i=1}x_i$ or $\mu=\frac1N\sum x$

weighted mean: $\bar x=\frac{\sum^n_{i=1}w_ix_i}{\sum^n_{i=1}w_i}$

median (approx): $L_1 + (\frac{n/2-\sum{freq}_l}{freq_{median}})width$

$\sum{freq}_l$: sum before the median interval

$width$: interval width: $L_2 -L_1$

$L_1$: low interval limit

mode: Value that occurs most frequently in the data

data matrix:

  • A data matrix of n data points with l dimensions generate a matrix with shape $n\cdot l$
  • Dissimilarity (distance) matrix: triangular matrix

data matrix
data matrix


  • z-score: $z=\frac{x-\mu}{\sigma}$, or using mean absolute deviation



  1. Minkowski distance (L-p norm):

minkowski distance equation
minkowski distance equation


Minkowski distance property
Minkowski distance property




  • Empirical formula: $mean-mode = 3\times (mean-median)$


multi model:

  • include bimodal and trimodal, etc. depend on peak number

multi model
multi model


  1. symmetric
  2. skewed: include positive skewed and negative skewed, their mean/median have opposite direction


normal distribution curve

normal distribution curve
normal distribution curve


Variance ($s^2 \text{ or } \sigma^2$) and standard deviation ($s \text{ or } \sigma$) use to measure data distribution

$$ s^2=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar x)^2\\ \sigma^2=\frac1N\sum^N_{i=1}(x_i-\mu)^2 $$

n: sample size, N: population size


  1. Boxplot: graphic display of five number summary
  2. Histogram: x axis are values, y axis are frequencies
  3. Quantile plot: each value x i is paired with f indicating that approximately 100 f% of data are $\leq$ x i
  4. Quantile-quantile (q-q) plot : graphs the quantiles of one univariate distribution against the corresponding quantiles of another
  5. Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

box plot:

Quartiles: Q1 (25 th percentile), Q3 (75 th percentile)

IQR: Q3 - Q1

Five number summary: min, Q1 , Q3 , max

box graph
box graph


Graph display of tabulated frequencies, shown as bars

  • Differences between histograms and bar charts: Histograms are used to show distributions of variables while bar charts are used to compare variables
  • Histograms Often Tell More than Boxplots: different histogram may have the same boxplot representation


cosine Similarity:

cosine similarity
cosine similarity

chi-square test:

  • The larger the Χ2 value, the more likely the variables are related
  • Null hypothesis: The two distributions are independent
  • Correlation does not imply causality

chi square equation
chi square equation


variance for single variable: $E((X-\mu)^2)$

covariance for two variable: $E((X_1-\mu_1)(X_2-\mu_2))=E(X_1X_2)-\mu_1\mu_2=E[X_1X_2]-E[X_1]E[X_2]$

  • the sign of covariance indicate the relation direction
  • if X1 and X2 are independent, $\sigma_{12}=0$, but reverse is not true

variance equation
variance equation


if $\rho_{12}>0$, positive correlation, $\rho_{12}=0$, uncorrelated, $\rho_{12}<0$, negative correlated

$$ \rho_{12}=\frac{\sigma_{12}}{\sqrt{\sigma_1^2\sigma_2^2}} $$

Kullback Leibler (KL) divergence:

Measure the difference between two probability distributions over the same variable x

$$ D_{KL}(p(x)||q(x))=\sum_{x\in X}p(x)ln\frac{p(x)}{q(x)}\\ D_{KL}(p(x)||q(x))=\int_{-\infty}^{\infty}p(x)ln\frac{p(x)}{q(x)} $$

KL divergence graph
KL divergence graph

  • when $p \not=0$ but $q=0$, the $D_{KL}$ is given as $\infty$, because one predict possible and one predict impossible

Data cleaning

missing data:

  1. Incomplete: Salary = ""
  2. Noisy: Salary = 10” (an error)
  3. Inconsistent: Age=“42”, Birthday = “03/07/2022
  4. Intentional: Jan. 1 as everyone’s birthday?

data Integration:

Combining data from multiple sources into a coherent store

Shell Fragment Cubes

the way to handle multi-dimensional data cube

space requirement:

Given a database of T tuples, D dimensions, and F shell fragment size, the fragment cubes’ space requirement is:

fragment cube requirement
fragment cube requirement


fragment cube query
fragment cube query

Download Note

附件名称:CS341 Mid1.pdf
文件大小:235.1 KB
下载次数: 30
最后修改: 2023-03-29 22:52