Data Visualisation

Visualisation: Histogram

Relative Frequency Histogram
Shows the location of the centre of the data
Shows the spread of the data

Skewness

Data skewed to the right (positively skewed) have a longer right tail

Dot plots

A graph of quantitative data where each data value is plotted as a dot above a horizontal scale of values.

Dots representing equal values are stacked.

Stem plots(Stem-and-Leaf plot)

Represents quantitative data by separating each value into two parts: the stem and the leaf

Time-Series Graph

A graph of time-series data

Pie Chart

Frequency Polygon

Pictograph

Representing data using images

Hints

Be careful w Nonzero Vertical Axis

Descriptive Stat

Measure of center: a value at the center/middle of a data set

Mean(Average)

Median: The value that cuts the data exactly in half

Mode: The value in the data set that occurs most frequently

Midrange:$\frac{Max+Min}{2}$, not resistant

Measures of Variation

Describe how spread out the point in a data set are.

Range

$Range = Max-Min$

Variance

A measure of how much the data values deviate from the mean.

Population consisting of N values

$\sigma ^2=\frac{(x_1-\mu)^2+\cdots+(x_n-\mu)^2}{N}=\frac{1}{N}\sum\limits_{i=1}^N(x_i-\mu)^2$

Sample consisting of n values

$s^2=\frac{(x_1-\overline{\mu})^2+\cdots+(x_n-\overline{\mu})^2}{n - 1}=\frac{1}{n-1}\sum\limits_{i=1}^n(x_i-\overline{x})^2$

never negative, zero only when all of the data values are exactly the same

Not resistant cuz its sensitive to extreme values

why use (n-1) to compute $s^2$

It makes $s^2$ an unbiased estimator of $\sigma^2$

Standard Deviation

Square root of the variance

Population standard deviation

$\sigma=\sqrt{\sigma^2}$

Sample standard deviation

$s=\sqrt{s^2}$

Standard Scores(z-scores)

The number of standard deviations that x lies above the mean

population

$z=\frac{x-\mu}{\sigma}$

sample

$z=\frac{x-\overline{x}}{s}$

if $x < \overline{x} \text{ or x} < \mu$ then $z < 0$

Percentiles

kth Percentile

A value $P_k$ such that k% of the values in the data set lie below it

Median: 50th percentile

first quartile $Q_1$: 25th percentile

third quartile $Q_3$: 75 percentile

IQR

$IQR=Q_3-Q_1$

Five-Number Summary

A concise summary of the data using five numbers:

Minimum, $Q_1$, Median, $Q_3$, Maximum

Boxplots

A line extending from the minimum value to the maximum value
A box with lines drawn at $Q_1$, the median and $Q_3$
A method for visulizing the five-number summary of a data set

Modifies Boxplots

Modify the original boxplot to show extreme observations or outliers

Find the quartiles $Q_1,Q_2,Q_3$
Compute $IQR=Q_3-Q_1$
In a modifies boxplot, a data value is considered an outlier

if it is greater than $Q_3+1.5 \cdot IQR$ or it is less than $Q_1-1.5 \cdot IQR$

A special symbol is used to identify outliers

The solid horizontal line extends only as far as the minimum data value that is not an outlier and the maximum data value that is not an outlier

Probability

Law of Large Numbers

The relative frequency probability of an event tends to approach the actual probability.

Subjective Probability

Estimate the probability of an event by using knowledge of the relevant circumstances

Mutually Exclusive Events

A and B never occur at the same time

Repeatedly sample items from a population

Sampling with replacement
- Each time observe the item and put it back in the population
Sampling without replacement
- Each time observe it and remove it from the population

Bayes Theorem

$P(A|B)=\frac{P(B|A)P(A)}{P(B)}=\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\overline{A})P(\overline{A})}$

Permutations: $P(n,r)=\frac{n!}{(n-r)!}$

Combinations:$C(n,r)=\frac{n!}{(n-r)!r!}$

Discrete Probability Distributions

Random Variable

A number with each outcome in the sample space S in a proabbility experiment.

$$X: S \rightarrow \mathbb{R}, i.e, \ \text{A function mapping outcomes in S to real numbers}$$

Discrete Random Variables

A random variable is discrete if it can take a finite or countably infinite number of different values

Continuous Random Variables

A random variable is continuous if it can take on an infinite number of different values that cannot be listed

Discrete Probability Distribution

Probability Mass Function

$p(x)=P(X=x)$

$\sum\limits_xp(x)=1$

Mean/Variabnce/Standard Deviation

Mean of X(Expected value of X)

$\mu=\sum\limits_xx\cdot p(x)$

Variance of X

$\sigma_x^2=\sum\limits_x[x^2\cdot p(x)]-\mu^2=\sum\limits_x(x-\mu )^2\cdot p(x)$

Standard Deviation

$\sigma_x=\sqrt{\sigma_x^2}$

Binomial Distribution

Binomial Random Variable

X = number of successes in the n trials

Probability Mass Function

A binomial experiment consisting of n trials with proabbility of success p

If X is the binomial random variable for thie experiment

$X$ ~ $Bin(n,p)$

$p(x)=P(X=x)=_nC_x p^x(1-p)^{n-x}$

Mean/Variance/Standard Deviation

$\mu_x=np$

$\sigma_x^2=np(1-p)$

$\sigma_x=\sqrt{np(1-p)}$

Poisson Distribution

$P(x)=\frac{\mu^x\cdot e^{-\mu}}{x!}$ $(x=0,1,2\cdots)$

$x$ ~ $Poiss(\mu)$

Assumption: Occurrences are:

Random
Independent of one another
Uniformly distributed over the interval being used

Mean

$\mu_x=\mu$

Variance

$\sigma_x^2=\mu$

Standard Deviation

$\sigma_x=\sqrt{\mu}$

Probability Density Function

$f(x)\geq 0$ for all x
The total area under the graph of f equals 1

Standard Normal Distribution

$f(z)=\frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}}$

$-\infty < z < \infty$

Normal Distribution

A continous random variable X has a normal distribution with parameters $-\infty < \mu < \infty$ and $\sigma^2 >0$ if it has a probability density function of the form

$f(x)=\frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}},-\infty < x < \infty$

A standard normal distribution: $\mu = 0$, $\sigma=1$, $Z$~ $N(0,1)$

Mean/Variance/Standard Deviation

$X$~ $N(\mu,\sigma^2)$

$\mu_x=\mu$, $\sigma_x^2=\sigma^2$, $\sigma_x=\sigma$

Standardise

Z-Score

$Z=\frac{x-\mu}{\sigma}$

$Z$~ $N(0,1)$

$z_{0.05}=1.645$, 95% confidence level

Sampling Distribution

$\overline{X}$ ~ $N(\mu, \frac{\sigma^2}{n})$

if Sample is normal, sample mean is also normal with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$

n increases, the standard deviation $\frac{\sigma}{\sqrt{N}}$ of $\overline{X}$ gets smaller

Larger samples yield less variability in sample mean

Central Limit Theorem

$X_1,X_2,\cdots,X_n$ : a random sample from a distribution with mean $\mu$ and standard deviation $\sigma$

If n is sufficiently large, $\overline{X}$ has approximately a $N(\mu,\frac{\sigma^2}{n})$ distribution

$X_1 + X_2 + \cdots + X_n = n \overline{X}$ also approximately normal

Point and Interval Estimation

Point Estimate

A point estimate $\hat{\theta}$ of a parameter $\theta$ is a single number that can be regarded as a sensible value for $\theta$

Unbiased Estimator

A point estimator $\hat{\theta}$ of a parameter $\theta$ is unbiased if $E(\hat{\theta})=\theta$

MVUE: Minimum variance unbiased estimator

Theorem: Let $X_1, X_2, \cdots, X_n$ be a random sample from a normal distribution with Mean $\mu$ and variance $\sigma^2$. Then the estimator $\hat{\mu}=\overline{X}$ is the MVUE for $\mu$

Interval Estimation

Estimate a range of possible values for our parameter

Since $X_1, X_2, \cdots, X_n$ are drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$, $\overline{X}$~ $N(\mu,\frac{\sigma}{n})$

Standardize $\overline{X}$ like $Z=\frac{\overline{X}-\mu}{\sigma / \sqrt{n}}$

We have Z ~ N(0,1)

Common Critical Values

Common Confidence Level Critical Value

0.90 1.645

0.95 1.96

0.99 2.575

Normal Confidence Interval for $\mu$($\sigma$ known)

Let $X_1, X_2, \cdots, X_n$ be a random sample from a normal distribution with unknown mean $\mu$ and known standard deviation $\sigma$

For Any $0 \leq \alpha \leq 1$, a $100(1-\alpha)%$ confidence interval for $\mu$ is:

$\overline{X}-z_{\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}} < \mu < \overline{X} + z_{\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}}$

$P(\overline{X}-z_{\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}} < \mu < \overline{X} + z_{\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}})=1-\alpha$

Normal Confidence Interval for $\mu$($\sigma$ unknown)

Let $X_1, X_2, \cdots, X_n$ be a random sample from a normal distribution with unknown mean $\mu$ and known standard deviation $\sigma$

Use sample standard deviation S to estimate the population stadard deviation $\sigma$

$T=\frac{\overline{X}-\mu}{S/ \sqrt{n}}$

T has t distribution with (n-1) degrees of freedom(df)

t distribution

One single parameter: degrees of freedom v (in our case, v = n - 1)

The distribution is symmetric

Mean is zero.

As df v gets really large, t distribution looks more like a standard normal distribution.

If T has such a distribution,

$t_{\alpha, n - 1}$ is the value such that $P(T>t_{\alpha, n-1})=\alpha$

$P(-t_{\frac{\alpha}{2},n-1}<T<t_{\frac{\alpha}{2},n-1}) = 1 - \alpha$

$P(-t_{\frac{\alpha}{2},n-1}<\frac{\overline{X}-\mu}{S/ \sqrt{n}}<t_{\frac{\alpha}{2},n-1}) = 1 - \alpha$

$P(\overline{X}-t_{\frac{\alpha}{2},n-1}\cdot \frac{S}{\sqrt{n}} < \mu < \overline{X}+t_{\frac{\alpha}{2},n-1}\cdot \frac{S}{\sqrt{n}})=1-\alpha$

$(0 \leq \alpha \leq 1, 100(1-\alpha)$% $\text{confidence interval for }\mu)$

Large Sample Confidence Interval for $\mu$

Let $X_1, X_2, \cdots, X_n$ be a random sample from a normal distribution with unknown mean $\mu$ and known standard deviation $\sigma$(not necessarily a normal distribution)

For any $0 \leq \alpha \leq 1$, an approximate $100(1-\alpha)$ % confidence interval for $\mu$ is as follows

$\overline{X}-z_{\frac{\alpha}{2}}\cdot \frac{S}{\sqrt{n}} < \mu < \overline{X}+z_{\frac{\alpha}{2}}\cdot \frac{S}{\sqrt{n}}$

Control the interval width

suppose normally distributed with $\sigma$

$\overline{X}-z_{\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}} < \mu < \overline{X}+z_{\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}}$

width of this confidence interval is $w=2z_{\frac{\alpha}{2} }\cdot \frac{\sigma}{\sqrt{n}}$

$n=(\frac{2z_{\frac{\alpha}{2}}\sigma}{w})^2$

Confidence Interval for p

A binomial experiment with n trials and an unknown probability of success p. Let X be the number of successes in the n trials.

natural point estimator for p is $\hat{p} =\frac{X}{N}$ i.e, the proportion of success in the n trials

if both $np \geq 10$ and $n(1-p) \geq 10$,then $\hat{p}$ approximately normally distributed by the Central Limit Theorem

$\hat{p} \approx N(p, \frac{p(1-p)}{n})$

Standardising $\hat{p}$

$Z=\frac{\hat{p}-p}{\sqrt{p(1-p)/n}} \approx N(0,1)$

$P(-z_{\frac{\alpha}{2}}< \frac{\hat{p}-p}{\sqrt{p(1-p)/n}} < z_{\frac{\alpha}{2}}) \approx 1-\alpha$

$0 \leq \alpha \leq 1$, appr $100(1-\alpha)$ % CI for p

$\frac{\hat{p}+z_{\alpha / 2}^2 / 2 n}{1+z_{\alpha / 2}^2 / n} \pm z_{\alpha / 2} \frac{\sqrt{\hat{p}(1-\hat{p}) / n+z_{\alpha / 2}^2 / 4 n^2}}{1+z_{\alpha / 2}^2 / n}$

Hypothesis Testing

Decision	H₀ is True	H₀ is False
Fail to Reject H₀	Correct Decision	Type II Error
Reject H₀	Type I Error	Correct Decision

$\alpha=P$(Type I Error)

$\beta=P$(Type || Error)

$\beta$ depends on the true value of the parameter we’re testing

Controlling $\beta$ can get complicated, so we can control $\alpha$

A hypothesis test with a probability $\alpha$ of making a Type I error is known as a level $\alpha$ test

Tests of $H_0:\mu=\mu_0$

Consider a sample $X_1, X_2, \cdots, X_n$ from a normal distribution with mean $\mu$ and standard deviation $\sigma$. Suppose $\sigma$ is known

Two-tailed test of $H_0:\mu=\mu_0$

A leval $\alpha$ test of $H_0$: $\mu=\mu_0$ vs $H_1:\mu \neq \mu_0$ is as follows

Reject $H_0$ if $Z < -z_{\alpha/2}$ or $Z > z_{\alpha/2}$

The test statistic:

$Z=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$

H1	Reject $H_0$ if
$\mu \neq \mu_0$	$z < -z_{\alpha/2}$ or $z > z_{\alpha/2}$
$\mu < \mu_0$	$z < -z_{\alpha}$
$\mu > \mu_0$	$z > z_{\alpha}$

Inferences from Two Samples

Comparing Independent Populations

$H_0:\mu1-\mu2 = 0,H_1: \mu1-\mu2 \neq 0$

$\overline{X}-\overline{Y}$ is an unbiased estimator of $\mu1-\mu2$

Since populations are normal $\overline{X}-\overline{Y}$ has a normal distribution

Assuming that $H_0$ is true, then $\mu1-\mu2=0$

$\overline{X}-\overline{Y}$ ~ $N(0,\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n})$

Standardize

$Z=\frac{\overline{X}-\overline{Y}}{\sqrt{\sigma_1^2/m+\sigma_2^2/n}}$~N(0,1)

Reject $H_0$ if z < $-z_{\alpha/2}$ or $z > z_{\alpha/2}$

Generalise to different $H_1$

H1	Reject $H_0$ if
$\mu_1 - \mu_2 \neq 0$	$z$ < $-z_{\alpha/2}$ or $z > z_{\alpha/2}$
$\mu_1-\mu_2 < 0$	$z < -z_{\alpha}$
$\mu_1-\mu_2 > 0$	$z > z_{\alpha}$

Confidence Interval for $\mu_1-\mu_2$

construct a $100(1-\alpha)$% CI for $\mu_1-\mu_2$

$\overline{X}-\overline{Y}$~ $N(\mu_1-\mu_2, \frac{\sigma^2}{m}+\frac{\sigma^2}{n})$

Standardize

$Z=\frac{\overline{X}-\overline{Y}-(\mu_1-\mu_2)}{\sqrt{\sigma_1^2/m+\sigma_2^2/n}}$~N(0,1)

$P(-z_{\alpha/2}<\frac{\overline{X}-\overline{Y}-(\mu_1-\mu_2)}{\sqrt{\sigma_1^2/m+\sigma_2^2/n}}<z_{\alpha/2})=1-\alpha$

CI: $\overline{X}-\overline{Y} \pm z_{\alpha/2} \cdot \sqrt{\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n}}$

Two-Sample t Test

Under $H_0$, test statistic

$T = \frac{\overline{X}-\overline{Y}}{\sqrt{S_1^2/m+S_2^2/n}} \approx t_{\nu}$

Degree of freedom $\nu$, round down to the nearest whole number

$\nu =\frac{(s_1^2/m+s_2^2/n)^2}{\frac{(s_1^2/m)^2}{m-1}+\frac{(s_2^2/n)^2}{n-1}}$

H1	Reject $H_0$ if
$\mu_1 - \mu_2 \neq 0$	$t < -t_{\alpha/2,\nu}$ or $t > t_{\alpha/2,\nu}$
$\mu_1-\mu_2 < 0$	$t < -t_{\alpha,\nu}$
$μ1−μ2>0$	$t > t_{\alpha,\nu}$

Pooled t Test

Under the assumption $\sigma_1=\sigma_2$

Pooled estimator of $\sigma^2$

$S_p^2 = \frac{m - 1}{m + n - 2} S_1^2 + \frac{n - 1}{m + n - 2} S_2^2$

$\nu = m + n - 2$

$\bar{X}-\bar{Y} \pm t_{\alpha/2,\nu} \cdot\sqrt{\frac{S_1^2}{m}+\frac{S_2^2}{n}}$

Paired t Test

data of pairs: $(X_1,Y_1), (X_2, Y_2),\cdots,(X_n,Y_n)$

Differences: $D_1=Y_1-X_1, D_2=Y_2-X_2, \cdots, D_n=Y_n-X_n$

test $H_0:\mu_D=0$ against $H_1:\mu_D \neq 0, H_1:\mu_D>0,H_1:\mu_D<0$

Construct level $\alpha$ tests

test statistic $T=\frac{\bar{D}}{S_D/ \sqrt{n}}$~ $t_{n-1}$

H1	Reject $H_0$ if
$\mu_D \neq 0$	$t < -t_{\alpha/2,n-1}$ or $t > t_{\alpha/2,n-1}$
$\mu_D< 0$	$t < -t_{\alpha,n-1}$
$\mu_D>0$	$t > t_{\alpha,n-1}$

Confidence Interval for $\mu_D$

$P(-t_{\alpha/2,n-1}<\frac{\bar{D}}{S_D/ \sqrt{n}}<t_{\alpha/2,n-1})=1-\alpha$

CI: $\bar{D} \pm t_{\alpha/2,n-1} \cdot \frac{S_D}{\sqrt{n}}$

Comparing Population Proportions

Proportion of successes for the two population: p1, p2

Test the hypotheses

$H_0:p1-p2 = 0$ against $H_1: p1-p2 \neq 0,H_1:p1-p2>0, H_1:p1-p2<0$

x: the number of success in the sample from the first population

y: the number of success in the sample from the second population

then x and y each have binomial distribution

Use sample proportions $\hat{p_1}=\frac{x}{m}$ and $\hat{p_2}=\frac{y}{n}$ to develop a test statistic

suppose $H_0$ is true, then p1=p2=p

If $np \geq 10, n(1-p) \geq 10$

then the following test statistic has an approximate normal distribution

(Application of the Central Limit Theorem)

$z = \frac{\hat{p_1}-\hat{p_2}}{\sqrt{p(1-p)(\frac{1}{m}+\frac{1}{n}})}$~N(0,1)

Since we are assuming p1=p2=p(i.e, $H_0$ is true), we can use data from both samples to estimate the common p

The natural estimator of p is

$\hat{p}=\frac{x+y}{m+n}=\frac{m}{m+n} \cdot \hat{p_1} + \frac{n}{m+n} \cdot \hat{p_2}$

substitute $\hat{p}$ for $p$

$z = \frac{\hat{p_1}-\hat{p_2}}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{m}+\frac{1}{n})}}$ ~N(0, 1)

Must Assume: $m\hat{p_1} \geq 10, m(1-\hat{p_1}) \geq 10, n\hat{p_2} \geq 10, n(1-\hat{p_2}) \geq 10$

H1	Reject $H_0$ if
$p_1 - p_2 \neq 0$	$z$ < $-z_{\alpha/2}$ or $z > z_{\alpha/2}$
$p_1-p_2 < 0$	$z < -z_{\alpha}$
$p_1-p_2 > 0$	$z > z_{\alpha}$

Correlation

Covariance

positive relationship

$S_{xy}=\frac{1}{n} \sum\limits_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})>0$

Analogously, negative relationship betweeen $x_i$ and $y_i$, covariance $S_{xy}$ Negative

Pearson Correlation coefficient

$r=\frac{S_{xy}}{\sqrt{S_{xx}\cdot S_{yy}}}=\frac{\sum\limits_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum\limits_{i=1}^n(x_i-\bar{x})^2\cdot \sum\limits_{i=1}^n(y_i-\bar{y})^2}}$

r close to 1 indicate a strong positive linear relationship;

r close to -1 indicate a strong negative linear relationship

r = 0 indicates the lack of a linear relationship between x and y

Tests of $H_0: \rho=0$

$H_1: \rho \neq 0, H_1: \rho < 0, H_1: \rho > 0$

$T = \frac{R\sqrt{n-2}}{\sqrt{1-R^2}}$~ $t_{n-2}$

H1	Reject $H_0$ if
$\rho \neq 0$	$t < -t_{\alpha/2,n-2}$ or $t > t_{\alpha/2,n-2}$
$\rho< 0$	$t < -t_{\alpha,n-2}$
$\rho>0$	$t > t_{\alpha,n-2}$

Linear Regression

Simple Linear Regression

There are parameters $\beta_0, \beta_1, \sigma^2$ such that for any fixed value of the independent variable x, the dependent variable is related to x through the model equation

$Y - \beta_0+\beta_1x+\epsilon$

Assume the error quantity $\epsilon$~ $N(0,\sigma^2)$

Least Squares Estimates

Find values of $b_0$ and $b_1$ that minimize $\sum_{i=1}^n(y_i-(b_0+b_1x_i))^2$

$b_0-\hat{\beta_0}$ and $b_1-\hat{\beta_1}$ that minimize this quantity are the least squares estimates of $\beta_0$ and $\beta_1$

In the case of simple linear regression

$\hat{\beta_1}=r \cdot \frac{S_y}{S_x}$, $\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}$

$\hat{\beta_1}$ has the same sign as the correlation coefficient r

$\hat{y}=\hat{\beta_0}+\hat{\beta_1}x$

Assessing Model Fit

residuals

$y_i-\hat{y_i}$

Prediction Interval

A range of values used to estimate a variable

A range of plausible values for the corresponding value of y:

$\hat{y} \pm t_{\alpha/2,n-2}s_e \sqrt{1+\frac{1}{n}+\frac{n(x-\bar{x})^2}{n(\sum_{i=1}^n x_i^2)-(\sum_{i=1}^n x_i)^2}}$

$s_e=\sqrt{\frac{\sum_{i=1}^n(y_i-\bar{y})}{n-2}}$

Relies on the assumption that the errors $\epsilon$ are normally distributed with mean 0 and a shared standard deviation $\sigma$

Total Variation

$\sum_{i=1}^n(y_i-\bar{y})^2$

Decompose:

$\sum_{i=1}^n(y_i-\bar{y})^2=\sum_{i=1}^n(\hat{y_i}-\bar{y})^2+\sum_{i=1}^n(y_i-\hat{y})^2$

total variation = explained variation + unexplained variation

Coefficient of determination

$R^2=\frac{\text{explained variation}}{\text{total variation}}$

$0 \leq R^2 \leq 1$ Ideally $R^2$ close to 1

In the case of simple linear regression $R^2=r^2$

Multiple Linear Regression

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \epsilon$

Error term $\epsilon$ ~ $N(0, \sigma^2)$

Find a good model

Use domain knowledge to determine which likely useful variables
Plot the residuals, which should look like random noise
Create a normal Q-Q plot of the residuals to access. (Straight line)
Examine $R_{adj}^2$
Conduct a hypothesis test of $H_0:\beta_1 = \beta_2 = \cdots =\beta_k = 0$

Adjusted Coefficient of Determination

n: sample size k: number of independent variables

$R_{adj}^2 = 1 - \frac{n-1}{n-(k+1)}(1-R^2)$

Formal Test of Model Utility

Assuming the residuals appear to be drawn from a normal distribution

Hypotheses:

$H_0: \beta_1 = \beta_2=\cdots=\beta_k = 0$

$H_1: \text{at least one } \beta_i \neq 0$

F-distribution

Dummy variables

numerical encoding of a qualitative variable

Goodness-of-Fit

Pearson’s Chi-Squared Test

Test at level $\alpha$

$H_0: p_1 = p_{10}, p_2=p_{20}, \cdots, p_k = p_{k0}$

$H_1: \text{at least one } p_i \text{ not equal to } p_{i0}$

Check $np_{0i} \geq 5$

Test statistic: $\chi ^2 = \sum\limits_{i = 1}^k \frac{(n_i - np_{i0})^2}{np_{i0}}$

Reject $H_0$ if $\chi^2 > \chi_{\alpha, k-1}^2$

Caesar Ciphers

To encrypt text

shift the alphabet the the left by a fixed amount

Contingency Tables

Test of Homogeneity

Test at level $\alpha$

$H_0: p_{1j}=p_{2j}=\cdots = p_{Ij}$

$H_1: H_0 \text{ is not true}$

Check that $\hat{e_{ij}} \geq 5$ for each i and j

test statistic

$\chi^2 = \sum\limits_{i=1}^I \sum\limits_{j=1}^J \frac{(n_{ij} - \hat{e_{ij}})^2}{\hat{e_{ij}}}$

Reject $H_0$ at level $\alpha$ if $\chi ^2 > \chi_{\alpha, (I-1)(J-1)}^2$

Test of Independence

Test at level $\alpha$

$H_0: p_{ij} = p_i \cdot p_j$

$H_1: H_0 \text{ is false}$

Check that $\hat{e_{ij}} \geq 5$ for each i and j

test statistic

$\chi^2 = \sum\limits_{i=1}^I \sum\limits_{j=1}^J \frac{(n_{ij} - \hat{e_{ij}})^2}{\hat{e_{ij}}}$

Reject $H_0$ at level $\alpha$ if $\chi ^2 > \chi_{\alpha, (I-1)(J-1)}^2$