Type of data:

  • Record Data: table
  • Graphs and Networks: relation
  • ordered data: time sequence, DNA, video stream
  • spatial, image and multimedia data: image, video, map


  • dimensionality
  • sparsity
  • resolution
  • distribution

data object: represent a entity in dataset


or dimension, features, variables
  • Nominal (e.g., red, blue)
  • Binary (e.g., {true, false})
  • Ordinal (e.g., {freshman, sophomore, junior, senior})
  • Numeric: quantitative
  • Interval: Measured on a scale of equal-sized units
  • Ratio: Inherent zero-point


discrete attribute: finite or countable set of value

continuous attribute: no limit, infinite



approximate median:

ntotal sample number
L1interval limit
widthinterval width (L2 - L1)
freq_lsum before median interval

$$ median = L_1 + (\frac{\frac n2-(\sum freq )_l}{freq_{median}})\cdot width $$


value occur most frequently in the data

Empirical formula:

works ONLY in unimodal

$$ mean - mode = 3\cdot(mode - median) $$


there are two version of variance:

-> n: the size of the sample

$$ s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2 $$

-> N: the size of the population

$$ \sigma ^2 = \frac{1}{N}\sum(x_i - \bar{\mu})^2 $$

standard deviation is the square root if variance, notation via: $s$ or $\sigma$


histogram analysis: Graph display of tabulated frequencies, shown as bars

Box graph: Data is represented with a box

box graph
box graph

different between bar chart and histogram:

  • Histograms are used to show distributions of variables while bar charts are used to compare variables
  • Histograms plot binned quantitative data while bar charts plot categorical data
  • Bars can be reordered in bar charts but not in histograms
  • Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width


