Data describing

mean: $\frac1N\sum^{N}_{i=1}x_i$

  • scaling: mean(kx)=kmean(x)
  • translating: mean(x+c)=mean(x)+c
  • $\sum^N_{i=1}(x-mean(\{x\}))=0$
  • sum of squared distances of data points to mean is minimized
  • affect strongly by outlier

standard deviation: $std(\{x\})=\sqrt{\frac1N\sum^{i=N}_{i=1}(x-mean(\{x\}))^2}$

  • when std is small, most data tend to close to mean
  • transitional
  • scalable
  • there are at most $\frac1{k^2}$ data points lying $k$ or more standard deviations away from the mean.
  • there must be at least one data item that is at least one standard deviation away from the mean

variance: $var(\{x\})=\frac1N(\sum^{i=N}_{i=1}(x_i-mean(\{x\}))^2)$

  • translating
  • $var[k]=0$, where $k$ is a constant
  • $var(\{kx\})=k^2var(\{x\})$

median: another use of a mean, less affect by outlier

  • scalable
  • translating

interquartile range:


The interquartile range of a dataset $\{x\}$ is $iqr(\{x\})=percentile(\{x\},75)-percentile(\{x\},25 )$

  • estimate how spread the data is, regardless the affect by outlier
  • scalable
  • transitional



  • bar chart vs histogram: bar char is for category while histogram for quantitative data
  • uni/multi modal: unimodal has one peak, multimodal has many, bimodal has two
  • skew: symmetric, left skew, right skew, left skew refer to its tail is long on left

left skewed
left skewed

box plot:

A box plot is a way to plot data that simplifies comparison

box plot
box plot

outlier: data item that are larger than $q_3+1.5(q_3-q_1)$ or smaller than $q_1-1.5(q_3-q_1)$

whisker: non-outlier data

standardized coordinate


coordinate with normalized data

$$ \hat {x_i}=\frac{x_i-mean(\{x\})}{std(\{x\})} $$

  • mean of standard coordinate is equal to 0
  • standard deviation is equal to 1
  • for many kinds of data, histograms of these standard coordinates look the same, which is the standard normal curve, given by:

$$ y(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2} $$

  • data in standard coordinate is called the normal data


$$ corr(\{(x,y)\})=\frac{\sum_i\hat{x_i}\hat{y_i}}{N} $$

  • range from -1 to 1, the larger (absolute value), the better predict
  • sign represent positive/negative correlation
  • 0 means no correlation,1 means $\hat {x_i}=\hat {y_i}$
  • $corr(\{(x,y\})=corr(\{y,x\})$
  • The value of the correlation coefficient is not changed by translating the data.
  • Scaling the data can change the sign, but not the absolute value


  1. Transform the data set into standard coordinates
  2. Compute the correlation $r$
  3. predict $\hat {y_0}=r\hat{x_0}$
  4. transform back into original coordinate
  • Rule of Thumb: The predicted value of y goes up by $r$ standard deviations when the value of $x$ goes up by one standard deviation.
  • root mean square error: $\sqrt{1-r^2}$



outcome: what we expect from the experiment, every run of the experiment produces exactly one of the set of possible outcomes

sample space: the set of all outcomes, which we usually write $\Omega$

event: event is a set of outcomes

  • $P(\Omega)=1$
  • $P(\emptyset)=0$
  • denote $A_j$ as a set of disjoined event, that is $A_i\cap A_j=\emptyset$ where $i\not = j$, we have:

$$ P(\cap_iA_i)=\sum_iP(A_i) $$



regardless the order, number of outcome when select $k$ from N

$$ \binom{N}{k}=\frac{N!}{k!(N-k)!} $$

probability calculating:

$$ P(A)+P(A^c)=1\\ P(A-B)=P(A)-P(A\cap B)\\ P(A\cup B) =P(A)+P(B)-P(A\cap B) $$


combination example
combination example

Conditional probability


the probability that $B$ occurs given that $A$ has definitely occurred. We write this as $P(B|A)$

$$ P(B|A)=\frac{P(B\cap A)}{P(A)}=\frac{P(A|B)P(B)}{P(A)} $$

  • $P(A)=P(A|B)P(B)+P(A|B^c)P(B^c)$


Two events A and B are independent if and only if $P(A\cap B)=P(A)P(B)$

In other form, if two events are independent, $P(A|B)=P(A)$ and $P(B|A)=P(B)$, or in simple put:

$$ P(A\cap B)=P(A)P(B) $$

  • pairwise independent: each pair in events list is independent. pairwise independent cannot illustrate independent.
  • conditional independent: $P(A_1\cap ...\cap A_n|B)=P(A_1|B)...P(A_n|B)$

Random variables


Given a sample space , a set of events $F$, a probability function $P$, and a countable set of real numbers $D$, a discrete random variable is a function with domain $\Omega $ and range $D$.

probability distribution function: $P(\{X=x\})$

cumulative distribution function: $P(\{X\leq x\})$

join probability function: $P(\{X=x\}\cap\{Y=y\})=P(x,y)$

Bayes' Rule:

$$ P(x|y)=\frac{P(y|x)P(x)}{P(y)} $$

independent random variable: $P(x,y)=p(x)p(y)$

probability density function


Let $p(x)$ be a probability density function (often called a pdf or density) for a continuous random variable $X$. We interpret this function by thinking in terms of small intervals. Assume that dx is an infinitesimally small interval. Then: $p(x)dx =P$

  • no negative
  • $\int^{\infty}_{-\infty}p(x)dx=1$

normalizing constant: $\frac{1}{\int^{\infty}_{-\\infty}g(x)dx}$

Expected Values


Given a discrete random variable $X$ which takes values in the set $D$ and which has probability distribution $P$, we define the expected value:

$$ \mathbb{E}[X]=\sum_{x\in D}xP(X=x)=\mathbb{E}_p[X] $$

for the continuous random variable $X$ which takes value in the set $D$, and which has probability distribution $P$, we define the expect value as:

$$ \mathbb{E}[X]=\int_{x\in D}xp(x)dx=\mathbb{E}_p[X] $$

Assume we have a function $f$ that maps a continuous random variable $X$ into a set of numbers $D_f$ . Then $f(X)$ is a continuous random variable, too, which we write $F$. The expected value of this random variable is:

$$ \mathbb{E}[f]=\int_{x\in D}f(x)p(x)dx=\text{the expection of }f $$

  • $\mathbb{E}[0]=0$
  • for any constant $k$, $\mathbb{E}[kf]=k\mathbb{E}[f]$
  • $\mathbb{E}[f+g]=\mathbb{E}[f]+\mathbb{E}[g]$
  • expectation are linear
  • the mean/expect value of random variable $X$ is $\mathbb{E}[X]$

variance of random variable:

$$ var[X]=\mathbb{E}[(X-\mathbb{E}[X])^2]=\mathbb{E}[X^2]-(\mathbb{E}[X])^2 $$

  • for constant $k$, $var[k]=0$
  • $var[kX]=k^2var[X]$
  • if $X,Y$ are independent, then $var[X+Y]=var[X]+var[Y]$

covariance for expected value:

$$ vos(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y] $$

  • if $X,Y$ are independent, then $\mathbb{E}[XY]=\mathbb{E}[X]\mathbb{E}[Y]$
  • if $X,Y$ are independent, then $cov(X,Y)=0$
  • $var[X]=cov(X,X)$

standard deviation of random variable:

$$ std(\{X\})=\sqrt{var[X]} $$

Markov's inequality:


the probability of a random variable taking a particular value must fall off rather fast as that value moves away from the mean

$$ P(\{|X|\geq a\})\leq \frac{\mathbb{E}[|X|]}{a} $$

Chebyshev's inequality:

  • give us the weak law of large number

$$ P(\{|X-\mathbb{E}[X]|\geq k\sigma\})\leq \frac{1}{k^2} $$

indicator function:

An indicator function for an event is a function that takes the value zero for values of $x$ where the event does not occur, and one where the event occurs. For the event $E$, we write:

indicator function
indicator function

  • $\mathbb{E}_P[\mathbb{I}_{[\varepsilon]}]=P(\varepsilon)$



discrete uniform distribution:

e.g. fair die, fair coin flip

A random variable has the discrete uniform distribution if it takes each of $k$ values with the same probability $\frac1k$, and all other values with probability zero.

Bernoulli Random Variables:

e.g. biased coin toss

Bernoulli random variable takes the value $1$ with probability $p$ and $0$ with probability $1-p$. This is a model for a coin toss, among other things

  • $mean =p$
  • $variance = p(1-p)$

The Geometric Distribution:

e.g. we flip this coin until the first head appears, the number of flip required to get one head

$$ P(\{X=n\})=(1-p)^{n-1}p $$

  • $mean=\frac1p$
  • $variance=\frac{1-p}{p^2}$

The Binomial Probability Distribution:

e.g. toss a coin, the probability that it comes up head $h$ times in $N$ flips

$$ P_b(h;N,p)=\binom{N}{h}p_h(1-p)^{N-h} $$

  • as long as $0\leq h\leq N$, in other case, the probability is equal to 0
  • $mean=Np$
  • $variance = Np(1-p)$

Multinomial Probabilities:

e.g. toss a die with k sides, the probability that it comes up a outcome in $N$ flips

multinomial defination
multinomial defination

The Poisson distribution:

e.g. the marketing phone calls you receive during the day time

$$ P(\{X=k\})=\frac{\lambda^ke^{-\lambda}}{k!} $$

where $\lambda > 0$ is a parameter often known as the intensity of the distribution

  • $mean=\lambda$
  • $variance=\lambda$


文件大小:407.1 KB
下载次数: 11
最后修改: 2023-03-29 22:43