Machine Learning

Linear Regression

Input space: x

Output space: yRy \in \mathbb{R}

Model space: f:xyf: x \rightarrow y

f(x):b+w1x1+w2x2++wnxn=b+wxxf(x) : b + w_1x_1 + w_2x_2 + \cdots + w_nx_n = b + w_xx
Loss(f)=i=1N(f(x)y)2Loss(f) = \sum\limits_{i = 1}^N(f(x) - y)^2

Fundamental problem of ML: Minf(Loss(s))=Minb,wi=1N(f(x)y)2\mathop{Min}\limits_{f}(Loss(s)) = \mathop{Min}\limits_{b, w}\sum\limits_{i = 1}^N(f(x) - y)^2

Loss(w,b)=N(wx+by)2Loss(w, b) = \sum\limits^{N}(wx + b - y)^2
Lb=i=1N(wxi+byi)=0\frac{\partial L}{\partial b} = \sum\limits_{i=1}^N(wx^i + b-y^i) = 0
  • wixi+bNiyi=0w\sum\limits_i x^i + bN-\sum\limits_{i}y^i = 0
Lw=i=1Nxi(wxi+byi)=0\frac{\partial L}{\partial w} = \sum\limits_{i=1}^Nx^i(wx^i + b-y^i) = 0
  • wi(xi)2+bixii(xiyi)=0w\sum\limits_{i} (x^i)^2 + b\sum\limits_{i}x^i - \sum\limits_{i}(x^iy^i) = 0
xˉ=1Nxi,yˉ=1Nyi,b=yˉwxˉ\bar x = \frac{1}{N}\sum x^i, \bar y = \frac{1}{N}\sum y^i, b = \bar y - w \bar x
  • wxˉ+byˉ=0w \bar x + b - \bar y = 0
  • wx2ˉ+bxˉxy=0w\bar{x^2} + b \bar x - \overline{xy} = 0
wxˉ2+(yˉwxˉ)xy=0w \bar x^2 + (\bar y - w \bar x) - \overline{xy} = 0
wx2+xˉyˉw(xˉ)2xy=0w\overline{x^2} + \bar x \bar y - w (\bar x) ^2 - \overline{xy} = 0
w(x2(xˉ)2)=xyxˉyˉw(\overline{x^2} - (\bar x)^2) = \overline{xy}- \bar x \bar y
w=xyxˉyˉx2(xˉ)2w^* = \frac{\overline{xy} - \bar x \bar y}{\overline{x^2} - (\bar x)^2}

Polynomial Regression

Still Linear

Treating x,x2,x, x^2,\cdots as distinct independent variables

Evaluation

  1. MAE: Mean Absolute Error
  • MAE=1Ni=1Nf(xi)yiMAE = \frac{1}{N}\sum\limits_{i=1}^N|f(x^i) - y^i|
  1. MSE: Mean Squared Error
  • MSE=1Ni=1N(f(xi)yi)2MSE = \frac{1}{N}\sum\limits_{i=1}^N(f(x^i) - y^i)^2
  1. RMSE: Root Mean Squared Error
  • RMSE=MSERMSE = \sqrt{MSE}
  1. MAPE: Mean Absolute Percentage Error
  • MAPE=1Ni=1Nf(xi)yiyiMAPE = \frac{1}{N}\sum\limits_{i=1}^N\frac{|f(x^i) - y^i|}{y^i}
  1. R2 Score
  • R2=1i=1N(f(xi)yi)2i=1N(yiyˉ)2R^2 = 1 - \frac{\sum\limits_{i=1}^N(f(x^i) - y^i)^2}{\sum\limits_{i=1}^N(y^i - \bar y)^2}

Optimization

Objective Function: F(x)RF(x) \rightarrow \mathbb{R}

Find x to maximize or minimize FF

Local optimum occur when f(x)=0f'(x) = 0

Fxi=0solve for x=(x1,x2,x0)\frac{\partial{F}}{\partial{x_i}} = 0 \rightarrow \text{solve for } x^*= (x_1^*,x_2^*, \cdots x_0^*)
xF=0\mathop{\nabla}\limits_{x} F = 0

Find local x, iterate

x0:some guessx_0: \text{some guess}
xt+1=xtαif(xt)x_{t+1} = x_t - \alpha_if'(x_t)

Gradient Descent

x(t+1)=x(t)αF(x(t))x(t+1) = x(t) - \alpha\nabla F(x(t))
  • Stochastic Gradient Descent(SGD)
    • Calculate the gradient using a random small part of the observations
    • w(0):some guess,b(0):some guessw(0):\text{some guess},b(0):\text{some guess}
    • x(t+1)=x(t)α[1kj=1kFi,j(x(t))]x(t+1) = x(t) - \alpha[\frac{1}{k}\sum\limits_{j=1}^{k}\nabla F_{i,j}(x(t))] , k:batch sizek:\text{batch size}
    • At update t+1t+1
      • pick ii at random
      • w(t+1)=w(t)αwLoss(w(t),b(t))w(t+1) = w(t)-\alpha \mathop {\nabla}_w Loss(w(t),b(t))
      • b(t+1)=b(t)αLossb(w(t),b(t))b(t+1)=b(t)-\alpha \frac{\partial{Loss}}{\partial b}(w(t),b(t))
Losswk=wk[(b+wxiyi)2]=2(xki)(b+wxiyi)\frac{\partial Loss}{\partial w_k}=\frac{\partial}{\partial w_k}[(b + w \cdot x^i- y^i)^2]=2(x_k^i)(b+ w \cdot x^i - y ^i)
Lossb=[(b+wxiyi)2]=2(b+wxiyi)\frac{\partial Loss}{\partial b}=[(b + w\cdot x^i - y^i)^2]=2(b + w \cdot x^i - y ^i)
wLossi=2(b+wxiyi)=2(b+wxiyi)xi\mathop{\nabla}\limits _w Loss_i = 2(b + w \cdot x^i - y ^i) = 2(b + w \cdot x^i - y ^i) x^i

Logistic Regression

Input: xx

Output: y{0,1}y \in \left \{ 0, 1 \right \}

Model: F(x)=P(x is in class 1)F(x) = P(\text{x is in class 1})

Sigmoid(z)=11+ezSigmoid(z) = \frac{1}{1 + e^{-z}}
F(x)=sigmoid(b+w1x1+w2x2++w0x0)=11+eb+wxF(x) = sigmoid(b+w_1x_1+w_2x_2+\cdots+w_0x_0)=\frac{1}{1+e^{-b+w\cdot x}}
minFi=1Nln[F(xi)yi(1F(xi))(1yi)]\mathop{min}\limits_{F} \sum \limits_{i=1}^{N} \ln [{F(x^i)}^{y^i} {(1-F(x^i))}^{(1-y_i)}]
minFi=1N[yiF(xi)(1yi)ln(1F(xi))]\mathop{min}\limits_{F} \sum \limits_{i=1}^{N} [-{y^i}{F(x^i)}-{(1-y_i)\ln(1-F(x^i))}]

For Classification

Input: xx

Output: y{1,+1}y \in \left \{-1, +1 \right \}

Softmax(v1,v2,,vk)Softmax(v_1, v_2, \cdots, v_k)

Softmax Regression

F(x)=softmax(b1+w1x,b2+w2x,,bc+wcx)F(x) = \text{softmax}(b_1 + w^1 \cdot x, b_2 + w^2 \cdot x, \cdots, b_c + w^c \cdot x)

Loss: Cross entropy loss

Metrics

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
Fβ score=(1+β2)PrecisionRecallβ2Precision+RecallF_{\beta \text{ score}} = (1 + \beta^2) \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}

Deep Learning

Simple Models

  • Linear Regression

    • f(x)=wx+bf(x) = w \cdot x + b
  • Perceptron [hard classifier]

    • f(x)=σ(wx+b)f(x) = \sigma(w \cdot x + b)
  • Logistic Regression

    • f(x)=sigmoid(wx+b)f(x) = \text{sigmoid}(w \cdot x + b)
  • Multiclass Classification

    • f(x)=Softmax(wx+b)f(x) = \text{Softmax}(w \cdot x + b)

Complex Model (Neural Networks)

f(x)=σ(w3σ(w2σ(w1x+b1)+b2)+b3)f(x) = \sigma (w_3\sigma(w_2 \sigma(w_1 x + b_1) + b_2) + b_3)

Activation Function

σ(w1x+b1)\sigma (w_1 x + b_1)

Examples:

  • σ(z)=tanh(z)\sigma(z) = \text{tanh}(z)
  • σ(z)=sigmoid(z)\sigma(z) = \text{sigmoid}(z)
  • σ(z)=ReLU(z)\sigma(z) = \text{ReLU}(z)
  • σ(z)=Leaky ReLU(z)\sigma(z) = \text{Leaky ReLU}(z)
  • σ(z)=Softmax(z)\sigma(z) = \text{Softmax}(z)

Introduce non-linearity into the model