## Dataset

Type of data:

• Record Data: table
• Graphs and Networks: relation
• ordered data: time sequence, DNA, video stream
• spatial, image and multimedia data: image, video, map

characteristic:

• dimensionality
• sparsity
• resolution
• distribution

data object: represent a entity in dataset

### Attribute

or dimension, features, variables
• Nominal (e.g., red, blue)
• Binary (e.g., {true, false})
• Ordinal (e.g., {freshman, sophomore, junior, senior})
• Numeric: quantitative
• Interval: Measured on a scale of equal-sized units
• Ratio: Inherent zero-point

#### Types

discrete attribute: finite or countable set of value

continuous attribute: no limit, infinite

## Measuring

### Median

approximate median:

symnote
ntotal sample number
L1interval limit
widthinterval width (L2 - L1)
freq_lsum before median interval

$$median = L_1 + (\frac{\frac n2-(\sum freq )_l}{freq_{median}})\cdot width$$

### Mode

value occur most frequently in the data

Empirical formula:

works ONLY in unimodal

$$mean - mode = 3\cdot(mode - median)$$

### Variance&Std

there are two version of variance:

-> n: the size of the sample

$$s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$$

-> N: the size of the population

$$\sigma ^2 = \frac{1}{N}\sum(x_i - \bar{\mu})^2$$

standard deviation is the square root if variance, notation via: $s$ or $\sigma$

## Plot

histogram analysis: Graph display of tabulated frequencies, shown as bars

Box graph: Data is represented with a box

different between bar chart and histogram:

• Histograms are used to show distributions of variables while bar charts are used to compare variables
• Histograms plot binned quantitative data while bar charts plot categorical data
• Bars can be reordered in bar charts but not in histograms
• Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width