### Data describing

mean: $\frac1N\sum^{N}_{i=1}x_i$

p27
• scaling: mean(kx)=kmean(x)
• translating: mean(x+c)=mean(x)+c
• $\sum^N_{i=1}(x-mean(\{x\}))=0$
• sum of squared distances of data points to mean is minimized
• affect strongly by outlier

standard deviation: $std(\{x\})=\sqrt{\frac1N\sum^{i=N}_{i=1}(x-mean(\{x\}))^2}$

p29
• when std is small, most data tend to close to mean
• transitional
• scalable
• there are at most $\frac1{k^2}$ data points lying $k$ or more standard deviations away from the mean.
• there must be at least one data item that is at least one standard deviation away from the mean

variance: $var(\{x\})=\frac1N(\sum^{i=N}_{i=1}(x_i-mean(\{x\}))^2)$

p31
• translating
• $var[k]=0$, where $k$ is a constant
• $var(\{kx\})=k^2var(\{x\})$

median: another use of a mean, less affect by outlier

• scalable
• translating

interquartile range:

p34

The interquartile range of a dataset $\{x\}$ is $iqr(\{x\})=percentile(\{x\},75)-percentile(\{x\},25 )$

• estimate how spread the data is, regardless the affect by outlier
• scalable
• transitional

### graph

histogram:

p35
• bar chart vs histogram: bar char is for category while histogram for quantitative data
• uni/multi modal: unimodal has one peak, multimodal has many, bimodal has two
• skew: symmetric, left skew, right skew, left skew refer to its tail is long on left

box plot:

A box plot is a way to plot data that simplifies comparison

outlier: data item that are larger than $q_3+1.5(q_3-q_1)$ or smaller than $q_1-1.5(q_3-q_1)$

whisker: non-outlier data

### standardized coordinate

p37

coordinate with normalized data

$$\hat {x_i}=\frac{x_i-mean(\{x\})}{std(\{x\})}$$

• mean of standard coordinate is equal to 0
• standard deviation is equal to 1
• for many kinds of data, histograms of these standard coordinates look the same, which is the standard normal curve, given by:

$$y(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$$

• data in standard coordinate is called the normal data

correlation:

$$corr(\{(x,y)\})=\frac{\sum_i\hat{x_i}\hat{y_i}}{N}$$

• range from -1 to 1, the larger (absolute value), the better predict
• sign represent positive/negative correlation
• 0 means no correlation,1 means $\hat {x_i}=\hat {y_i}$
• $corr(\{(x,y\})=corr(\{y,x\})$
• The value of the correlation coefficient is not changed by translating the data.
• Scaling the data can change the sign, but not the absolute value

predict:

p62
1. Transform the data set into standard coordinates
2. Compute the correlation $r$
3. predict $\hat {y_0}=r\hat{x_0}$
4. transform back into original coordinate
• Rule of Thumb: The predicted value of y goes up by $r$ standard deviations when the value of $x$ goes up by one standard deviation.
• root mean square error: $\sqrt{1-r^2}$

### probability

p70

outcome: what we expect from the experiment, every run of the experiment produces exactly one of the set of possible outcomes

sample space: the set of all outcomes, which we usually write $\Omega$

event: event is a set of outcomes

• $P(\Omega)=1$
• $P(\emptyset)=0$
• denote $A_j$ as a set of disjoined event, that is $A_i\cap A_j=\emptyset$ where $i\not = j$, we have:

$$P(\cap_iA_i)=\sum_iP(A_i)$$

combination:

p74

regardless the order, number of outcome when select $k$ from N

$$\binom{N}{k}=\frac{N!}{k!(N-k)!}$$

probability calculating:

$$P(A)+P(A^c)=1\\ P(A-B)=P(A)-P(A\cap B)\\ P(A\cup B) =P(A)+P(B)-P(A\cap B)$$

application:

### Conditional probability

P84

the probability that $B$ occurs given that $A$ has definitely occurred. We write this as $P(B|A)$

$$P(B|A)=\frac{P(B\cap A)}{P(A)}=\frac{P(A|B)P(B)}{P(A)}$$

• $P(A)=P(A|B)P(B)+P(A|B^c)P(B^c)$

independent:

Two events A and B are independent if and only if $P(A\cap B)=P(A)P(B)$

In other form, if two events are independent, $P(A|B)=P(A)$ and $P(B|A)=P(B)$, or in simple put:

$$P(A\cap B)=P(A)P(B)$$

• pairwise independent: each pair in events list is independent. pairwise independent cannot illustrate independent.
• conditional independent: $P(A_1\cap ...\cap A_n|B)=P(A_1|B)...P(A_n|B)$

### Random variables

P103

Given a sample space , a set of events $F$, a probability function $P$, and a countable set of real numbers $D$, a discrete random variable is a function with domain $\Omega$ and range $D$.

probability distribution function: $P(\{X=x\})$

cumulative distribution function: $P(\{X\leq x\})$

join probability function: $P(\{X=x\}\cap\{Y=y\})=P(x,y)$

Bayes' Rule:

$$P(x|y)=\frac{P(y|x)P(x)}{P(y)}$$

independent random variable: $P(x,y)=p(x)p(y)$

#### probability density function

P107

Let $p(x)$ be a probability density function (often called a pdf or density) for a continuous random variable $X$. We interpret this function by thinking in terms of small intervals. Assume that dx is an infinitesimally small interval. Then: $p(x)dx =P$

• no negative
• $\int^{\infty}_{-\infty}p(x)dx=1$

normalizing constant: $\frac{1}{\int^{\infty}_{-\\infty}g(x)dx}$

### Expected Values

P110

Given a discrete random variable $X$ which takes values in the set $D$ and which has probability distribution $P$, we define the expected value:

$$\mathbb{E}[X]=\sum_{x\in D}xP(X=x)=\mathbb{E}_p[X]$$

for the continuous random variable $X$ which takes value in the set $D$, and which has probability distribution $P$, we define the expect value as:

$$\mathbb{E}[X]=\int_{x\in D}xp(x)dx=\mathbb{E}_p[X]$$

Assume we have a function $f$ that maps a continuous random variable $X$ into a set of numbers $D_f$ . Then $f(X)$ is a continuous random variable, too, which we write $F$. The expected value of this random variable is:

$$\mathbb{E}[f]=\int_{x\in D}f(x)p(x)dx=\text{the expection of }f$$

• $\mathbb{E}=0$
• for any constant $k$, $\mathbb{E}[kf]=k\mathbb{E}[f]$
• $\mathbb{E}[f+g]=\mathbb{E}[f]+\mathbb{E}[g]$
• expectation are linear
• the mean/expect value of random variable $X$ is $\mathbb{E}[X]$

variance of random variable:

$$var[X]=\mathbb{E}[(X-\mathbb{E}[X])^2]=\mathbb{E}[X^2]-(\mathbb{E}[X])^2$$

• for constant $k$, $var[k]=0$
• $var[kX]=k^2var[X]$
• if $X,Y$ are independent, then $var[X+Y]=var[X]+var[Y]$

covariance for expected value:

$$vos(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]$$

• if $X,Y$ are independent, then $\mathbb{E}[XY]=\mathbb{E}[X]\mathbb{E}[Y]$
• if $X,Y$ are independent, then $cov(X,Y)=0$
• $var[X]=cov(X,X)$

standard deviation of random variable:

$$std(\{X\})=\sqrt{var[X]}$$

Markov's inequality:

P116

the probability of a random variable taking a particular value must fall off rather fast as that value moves away from the mean

$$P(\{|X|\geq a\})\leq \frac{\mathbb{E}[|X|]}{a}$$

Chebyshev's inequality:

• give us the weak law of large number

$$P(\{|X-\mathbb{E}[X]|\geq k\sigma\})\leq \frac{1}{k^2}$$

indicator function:

An indicator function for an event is a function that takes the value zero for values of $x$ where the event does not occur, and one where the event occurs. For the event $E$, we write:

• $\mathbb{E}_P[\mathbb{I}_{[\varepsilon]}]=P(\varepsilon)$

### Distribution

P131

discrete uniform distribution:

e.g. fair die, fair coin flip

A random variable has the discrete uniform distribution if it takes each of $k$ values with the same probability $\frac1k$, and all other values with probability zero.

Bernoulli Random Variables:

e.g. biased coin toss

Bernoulli random variable takes the value $1$ with probability $p$ and $0$ with probability $1-p$. This is a model for a coin toss, among other things

• $mean =p$
• $variance = p(1-p)$

The Geometric Distribution:

e.g. we flip this coin until the first head appears, the number of flip required to get one head

$$P(\{X=n\})=(1-p)^{n-1}p$$

• $mean=\frac1p$
• $variance=\frac{1-p}{p^2}$

The Binomial Probability Distribution:

e.g. toss a coin, the probability that it comes up head $h$ times in $N$ flips

$$P_b(h;N,p)=\binom{N}{h}p_h(1-p)^{N-h}$$

• as long as $0\leq h\leq N$, in other case, the probability is equal to 0
• $mean=Np$
• $variance = Np(1-p)$

Multinomial Probabilities:

e.g. toss a die with k sides, the probability that it comes up a outcome in $N$ flips

The Poisson distribution:

e.g. the marketing phone calls you receive during the day time

$$P(\{X=k\})=\frac{\lambda^ke^{-\lambda}}{k!}$$

where $\lambda > 0$ is a parameter often known as the intensity of the distribution

• $mean=\lambda$
• $variance=\lambda$