Theory

Hyperplane:

wTx+b=0w^Tx+b=0
  • ww: Weighted Coefficient
  • b: Intercept
  • x: Sample Points
  • y = +1 : Positive sample; y = -1: Negative Sample

Functional Margin

γi^=yi(wTxi+b)\hat{\gamma^{i}} = y^i(w^Tx^i+b)

Geometric Margin

The actual distance from the sample point to the hyperplane

xix^i : Sample point A
γi{\gamma}^i: Modulus of the vector BA\overrightarrow{BA}
B: xiγiWWx^i-{\gamma}_i \cdot \frac{W}{||W||}
For B is on the line
wT(xiγiWW+b=0)w^T(x^i-{\gamma}_i \cdot \frac{W}{||W||}+b=0)

Hence when yA=1y_A=1 , the Geometric Margin is

γi=wTxi+bw=(ww)Txi+bw{\gamma}^i=\frac{w^Tx^i+b}{||w||}={(\frac{w}{||w||})}^{T}x^i+\frac{b}{||w||}
More Generally
γi=yi((ww)Txi+bw){\gamma}^i=y^i({(\frac{w}{||w||})}^{T}x^i+\frac{b}{||w||})
The functional margin is related to the geometric margin
γ=γ^w\gamma=\frac{\hat{\gamma}}{||w||}

Kernels

Instead of utilizing SVMs with the original input attributes x, we might opt to train with certain features φ(x).
This can be achieved by revising our prior algorithm and substituting every instance of x with ϕ(x)\phi(x) .
Since the algorithm is expressed solely in terms of inner products x,z⟨x,z⟩ ,
we could replace those inner products with ϕ(x),ϕ(z)⟨\phi(x),\phi(z)⟩.
In particular, when provided with a feature mapping φ, we establish the corresponding Kernel as follows:

K(x,z)=ϕ(x)Tϕ(z).K(x, z) = \phi(x)^{T}\phi(z).
At every instance where we had ⟨x, z⟩ in our algorithm previously,
we could effortlessly substitute it with K(x,z) and our updated algorithm would then be trained using the features φ.

Now, let’s talk about a slightly different view of kernels.
If φ(x) and φ(z) are close together, then we might expect K(x,z)=ϕ(x)Tϕ(z)K(x,z) = \phi(x)^{T}\phi(z) to be large.
Conversely, if φ(x) and φ(z) are far apart—say nearly orthogonal to each other—then K(x,z)=ϕ(x)Tϕ(z)K(x, z) = \phi(x)^{T}\phi(z) will be small.
So, we can think of K(x, z) as some measurement of how similar are φ(x) and φ(z), or of how similar are x and z.

K(x,z)=exp(xz22σ2)=exp(x22σ2)exp(z22σ2)exp(x,zσ2)K(x,z)=exp(\frac{-||x-z||^2}{2\sigma^2})=exp(\frac{-||x||^2}{2\sigma^2})exp(\frac{-||z||^2}{2\sigma^2})exp(\frac{\left\langle x, z\right\rangle}{\sigma^2})
where the Taylor expansion for the third term is
exp(x,zσ2)=1+x,zσ2+x,z22σ4+x,z33!σ6+=i=0nx,znn!(σ)nexp(\frac{ \left\langle x, z \right\rangle}{\sigma^2})=1+\frac{\left\langle x,z\right\rangle}{\sigma^2}+\frac{\left\langle x,z\right\rangle^2}{2\sigma^4}+\frac{\left\langle x,z\right\rangle^3}{3!\sigma^6}+\cdots=\sum\limits_{i=0}^n\frac{\left\langle x,z\right\rangle^n}{n!(\sigma)^n}

Due to the existence of the Taylor expansion, the mapping from low to infinite dimensions is achieved with the help of the Gaussian Kernel or Radial Basis Function (RBF)

Common Kernel Functions

Linear Kernel

K(x,z)=x,zK(x,z)=\left\langle x,z \right\rangle

Polynomial Kernel

K(x,z)=(γx,z+r)dK(x,z)=(\gamma\left\langle x,z \right\rangle + r)^d

Gaussian Kernel

K(x,z)=exp(γxz2)K(x,z)=exp(-\gamma||x-z||^2)

Sigmoid Kernel

K(x,z)=tanh(γx,z+r)K(x,z)=tanh(\gamma\left\langle x,z \right\rangle + r)

Lagrange Multipliers

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints
The basic idea is to convert a constrained problem into a form such that the derivative test of an unconstrained problem can still be applied.
Thus, for the conditional extrema of the multivariate function z=f(x,y,z,)z = f(x,y,z,\cdots) under multiple constraints ϕ(x,y,)=0,ψ(x,y,)=0\phi(x,y,\ldots) = 0, \quad \psi(x,y,\ldots) = 0 ,
the steps for solving using the Lagrange multiplier method can be summarized as:

  1. Make a Lagrangian function

    F(x,y,z,,λ,ν,)=f(x,y,z,)+μϕ(x,y,)+μϕ(x,y,)+μϕ(x,y,)F(x,y,z,\cdots,\lambda,\nu,\cdots)=f(x,y,z,\cdots)+ \mu\phi(x,y,\cdots)+\mu\phi(x,y,\cdots)+\mu\phi(x,y,\cdots)

  2. Find the stationary points of the multivariate function F(x,t,z,,λ,μ,)F(x,t,z,\cdots,\lambda,\mu,\cdots)
    Solve the system of equations to find the stationary point (x0,y0,z0,,λ0,μ0,)(x_0,y_0,z_0,\cdots,\lambda_0,\mu_0,\cdots),

    then f(x0,y0,z0,)f(x_0,y_0,z_0,\cdots) is the possible conditional extreme value

{Fx=0Fy=0Fλ=0 \left\{ \begin{array}{c} F_x=0 \\ F_y=0 \\ \cdots \\ F_{\lambda}=0\\ \cdots \end{array} \right.
Now f(x0,y0,z0)f(x_0,y_0,z_0)may be the possible conditional extrema.

Generalized Lagranigian

L(w,α,β)=f(w)+i=1kαigi(w)+i=1lβihi(w)L(w,\alpha,\beta)=f(w)+\sum\limits_{i=1}^k\alpha_ig_i(w)+\sum\limits_{i=1}^l\beta_ih_i(w)
Here, the αi\alpha_i ’s and βi\beta_i’s are the Lagrange multipliers Consider the quantity
θP(w)=maxα,β:αi0L(w,α,β)\theta_{P}(w)= \max \limits_{\alpha,\beta : \alpha_i \geq 0} L(w,\alpha,\beta)

P stands for “Primal”
Primal constraints: gi(w)0g_i(w) \leq 0 and hi(w)=0h_i(w)=0
If the constraints are satisfied for a particular value of ww

θp(w)=maxα,β:αi0(i=1kαigi(w)+i=1lβihi(w))\theta_p(w)=\max\limits_{\alpha,\beta:\alpha_i \geq 0}(\sum\limits_{i=1}^{k}\alpha_ig_i(w)+\sum\limits_{i=1}^{l}\beta_{i} h_i(w))

which means

θp(w)=f(w)+0\theta_p(w)=f(w)+0

If the constraints are not satisfied

θp(w)=\theta_p(w)=\infty

Dual Optimization Problem

Define

θD(α,β)=minwL(w,α,β)\theta_D(\alpha,\beta)=\min\limits_{w}L(w,\alpha,\beta)
D stands for dual We can now pose the dual optimization problem:
d=maxα,β:αi0θD(α,β)=maxα,β:αi0minwL(w,α,β)d^{*}=\max\limits_{\alpha,\beta:\alpha_i \geq 0}\theta_{D}(\alpha,\beta)=\max\limits_{\alpha,\beta:\alpha_i \geq 0}\min\limits_{w}L(w,\alpha,\beta)

The primal and the dual problem are related like this:

d=maxα,β:αi0L(w,α,β)minwmaxα,β:αi0L(w,α,β)=pd^{*}=\max\limits_{\alpha,\beta: \alpha_i \geq 0}L(w,\alpha,\beta) \leq \min\limits_{w} \max\limits_{\alpha,\beta:\alpha_i \geq 0}L(w,\alpha,\beta)=p^{*}

Karush-Kuhn-Tucker(KKT)

Suppose f(w)f(w) and the gi(w)g_i(w) ’s are convex, and the hih_i’s are affine.
Suppose further that the constraints gig_i are strictly feasible;
this means that there exists some ww so that gi(w)0g_i(w) \le 0 for all ii.
Under the above assumptions, there must exist w,α,βw_*, \alpha_*, \beta_* so that ww_* is the solution to the primal problem,
and moreoverα,β\alpha_*,\beta_* are the solution to the dual problem,
Furthermore, p=d=L(w,α,β)p_* = d_* = L(w_*, \alpha_*, \beta_*).

w,αandβw_*, \alpha_* and \beta_* satisfy the Karush-Kuhn-Tucker (KKT) conditions, which are as follows:
wiL(w,α,β)=0,i=1,,dβiL(w,α,β)=0,i=1,,lαigi(w)=0,i=1,,kgi(w)0,i=1,,kα0,i=1,,k\begin{aligned} \frac{\partial}{\partial w_i} \mathcal{L}\left(w^*, \alpha^*, \beta^*\right) & =0, i=1, \ldots, d \\ \frac{\partial}{\partial \beta_i} \mathcal{L}\left(w^*, \alpha^*, \beta^*\right) & =0, i=1, \ldots, l \\ \alpha_i^* g_i\left(w^*\right) & =0, i=1, \ldots, k \\ g_i\left(w^*\right) & \leq 0, i=1, \ldots, k \\ \alpha^* & \geq 0, i=1, \ldots, k\end{aligned}
αigi(w)=0\alpha_i^* g_i(w^*) =0 is called the KKT dual complementarty

Specifically, it implies that if αi0α_i^{*} \ge 0, then gi(w)=0g_i(w^*) = 0.
This will be key for showing that the SVM has only a small number of “support vectors”.

Optimization

Hard Margin

Find the local minimal value W(a) of L with w and b

The final optimization objective of the SVM hard margin is

minw,b12w2\min\limits_{w,b}\frac{1}{2}||w||^2
s.t.y(i)(wTx(i)+b)1,i=1,,ns.t. y^{(i)}(w^Tx^{(i)}+b) \geq 1, i=1,\cdots,n
The Generalized Lagranigian Function is:
L(w,b,α)=12w2i=1mαi[y(i)(wTx(i)+b)1]L(w,b,\alpha)=\frac{1}{2}||w||^2-\sum\limits_{i=1}^m\alpha_i[y^{(i)}(w^{T}x^{(i)}+b)-1]

where \alpha_i≥0 is the Lagrangian multiplier

gi(w)=y(i)(wTx(i)+b)+10g_i(w)=-y^{(i)}(w^{T}x^{(i)}+b)+1 \leq 0

Then the original problem’s dual optimization problem is:

d=maxα,αi0minw,bL(w,b,α)d^* = \max\limits_{\alpha,\alpha_i \geq 0}\min\limits_{w,b}L(w,b,\alpha)

To solve this,we need to calculate the partial derivatives for w and b and make it to zero.

Lw=wi=1mαiy(i)x(i)=0\frac{\partial L}{\partial w}=w-\sum\limits_{i=1}^m \alpha_{i}y^{(i)}x^{(i)}=0
Lb=i=1mαiy(i)=0\frac{\partial L}{\partial b}=-\sum\limits_{i=1}^m\alpha_{i}y^{(i)}=0

Then we have

w=i=1mαiy(i)w=\sum\limits_{i=1}^m\alpha_i y^{(i)}
W(α)=minw,bL(w,b,α)=i=1mαi12i,j=1my(i)y(j)αiαj(x(i))Tx(j)W(\alpha)=\min\limits_{w,b}L(w,b,\alpha)=\sum\limits_{i=1}^{m}\alpha_i-\frac{1}{2}\sum\limits_{i,j=1}^{m}y_{(i)}y_{(j)}\alpha_i\alpha_j(x^{(i)})^{T}x^{(j)}

Solution Steps:

L(w,b,a)=12wTwi=1mα,[y(1)(wTx(i)+b)1]=12wTwi=1mαiy(i)wTx(i)i=1mαiy(i)b+i=1αi=12wTwwTi=1mαiy(i)x(i)bi=1maiy(i)+i=1mαi\begin{aligned} L(w, b, a)&=\frac{1}{2} w^{T} w-\sum_{i=1}^m \alpha,\left[y^{(1)}\left(w^{T} x^{(i)}+b\right)-1\right] \\ & =\frac{1}{2} w^{\mathrm{T}} w-\sum_{i=1}^m \alpha_i y^{(i)} w^{\mathrm{T}} x^{(i)}-\sum_{i=1}^m \alpha_i y^{(i)} b+\sum_{i=1}^{\infty} \alpha_i \\ & =\frac{1}{2} w^{T} w-w^{T} \sum_{i=1}^m \alpha_i y^{(i)} x^{(i)}-b \sum_{i=1}^m a_i y^{(i)}+\sum_{i=1}^m \alpha_i \\ & \end{aligned}
W(α)=12wTwwTw+i=1mαi=i=1mαi12ww=i=1mαi12i,j=1mαiαiy(i)y(j)(x(i))Tx(j) \begin{aligned} W(\alpha) & =\frac{1}{2} w^{T} w-w^{T} w+\sum_{i=1}^m \alpha_i \\ & =\sum_{i=1}^m \alpha_i-\frac{1}{2} w^{\top} w \\ & =\sum_{i=1}^m \alpha_i-\frac{1}{2} \sum_{i, j=1}^m \alpha_i \alpha_i y^{(i)} y^{(j)}\left(x^{(i)}\right)^T x^{(j)} \end{aligned}

Hence

W(α)=minw,bL(w,b,α)=i=1nαi12i,j=1ny(i)y(j)αiαj(x(i))Tx(j)W(\alpha)=\min\limits_{w, b}\mathcal{L}(w, b, \alpha)=\sum\limits_{i=1}^n \alpha_i-\frac{1}{2} \sum\limits_{i, j=1}^n y^{(i)} y^{(j)} \alpha_i \alpha_j\left(x^{(i)}\right)^T x^{(j)}

Find the local maximum value W(a) of L with a

Consider an optimization Problem:

maxW(α)=maxi=1mα112i,j=1my(i)y(j)αiαj(x(i))x(0) s. t. αi0,i=1,2,,m\begin{aligned} & \max W(\alpha)=\max \sum_{i=1}^m \alpha_1-\frac{1}{2} \sum\limits_{i,j=1}^{m} y^{(i)} y^{(j)} \alpha_i \alpha_j\left(x^{(i)}\right)^{\top} x^{(0)} \\ & \text { s. t. } \alpha_i \geqslant 0, \quad i=1,2, \cdots, m\end{aligned}

With the constraint Condition:

i=1mαiy(i)=0\sum\limits_{i=1}^{m} \alpha_{i} y^{(i)}=0

Furthermore,suppose α=(α1,α2,,αm)T\alpha^*=(\alpha_1^*,\alpha_2^*,\cdots, \alpha_m^*)^T is the solution of the dual optimization problem,
then in order for the solutions of dual optimization problem and the original optimization problem are identical,
the following KKT conditions need to be satisfied:

wL(w,b,α)=wi=1mαiy(i)x(i)=0bL(w,b,α)=i=1mαiy(i)=0αigi(w)=αi[y(i)(wx(i)+b)1]=0,i=1,2,,mgi(w)=y(i)(wx(i)+b)+10,i=1,2,,mαi0,i=1,2,,m\begin{aligned} & \frac{\partial}{\partial w} L\left(w^*, b^*, \alpha^*\right)=w^*-\sum_{i=1}^m \alpha_i^* y^{(i)} x^{(i)}=0 \\ & \frac{\partial}{\partial b} L\left(w^*, b^*, \alpha^*\right)=-\sum_{i=1}^m \alpha_i^* y^{(i)}=0 \\ & \alpha_i^* g_i\left(w^*\right)=\alpha_i^*\left[y^{(i)}\left(w^* \cdot x^{(i)}+b^*\right)-1\right]=0, \quad i=1,2, \cdots, m \\ & g_i\left(w^*\right)=-y^{(i)}\left(w^* \cdot x^{(i)}+b^*\right)+1 \leqslant 0, \quad i=1,2, \cdots, m \\ & \alpha_i^* \geqslant 0, \quad i=1,2, \cdots, m \end{aligned}
Therefore,
w=i=1mαiy(i)x(i)w^*=\sum\limits_{i=1}^m\alpha_i^*y^{(i)}x^{(i)}
So at least there exists one αj0\alpha_j^ \ge 0{*}

if αj0\alpha_j^ \ge 0{*}, then we must have:

y(j)(wx(j)+b)=1y^{(j)}(w^* \cdot x^{(j)}+b^*)=1
From this we got that the sample point x(j)x^{(j)} is a support vector, and this shows that the hyperplane is only related to the support vectors.

Moreover,for (y(j)=1)(y^{(j)}=1)

b=y(j)i=1mαiy(i)(x(i)x(j))b^* = y^{(j)}-\sum\limits_{i=1}^m\alpha_{i}^*y^{(i)}(x^(i) \cdot x^(j))

Soft Margin

The final optimization objective of the SVM hard margin is

minw,b12w2+Ci=1mζi\min\limits_{w,b}\frac{1}{2}||w||^2 + C\sum\limits_{i=1}^{m} \zeta_i
s.t.y(i)(wTx(i)+b)1ζi,ζi0,i=1,2,,ms.t. y^{(i)}(w^Tx^{(i)}+b)\geq 1- \zeta_i,\zeta_i \geq 0, i=1,2,\cdots,m

Then the generalized lagranigian function is:

L(w,b,ζ,α,r)=12w2+Ci=1mζii=1mαi[y(i)(wTx(i)+b)1+ζi]i=1mriζiL(w, b, \zeta, \alpha, r)=\frac{1}{2}\|w\|^2+C \sum_{i=1}^m \zeta_i-\sum_{i=1}^m \alpha_i\left[y^{(i)}\left(w^{T} x^{(i)}+b\right)-1+\zeta_i\right]-\sum_{i=1}^m r_i \zeta_i

Among which αi0\alpha_i \geq 0 and ri0r_i \geq 0 and also note that

{gi(w)=y(i)(wTx(i)+b)+1ζi0hi(ζ)=ζi0;i=1,2,,m\left\{\begin{array}{l}g_i(w)=-y^{(i)}\left(w^{T} x^{(i)}+b\right)+1-\zeta_i \leqslant 0 \\ h_i(\zeta)=-\zeta_i \leqslant 0 ; \quad i=1,2, \cdots, m\end{array}\right.

So we can get the dual optimization problem of the primal optimization problem

d=maxα,rminw,b,ζL(w,b,ζ,α,r)d^*=\max\limits_{\alpha,r}\min\limits_{w,b,\zeta}L(w,b,\zeta,\alpha,r)

Find the local minimal value W(a,r) of L with w ,b and ζ\zeta

Lw=wi=1mαiy(i)x(i)=0\frac{\partial L}{\partial w}=w-\sum\limits_{i=1}^{m}\alpha_{i}y^{(i)}x^{(i)}=0
Lb=i=1mαiy(i)=0Lζi=Cαiri=0w=i=1mαiy(i)x(i)i=1mαiy(i)=0Cαiri=0\begin{gathered}\frac{\partial L}{\partial b}=-\sum_{i=1}^m \alpha_i y^{(i)}=0 \\ \frac{\partial L}{\partial \zeta_i}=C-\alpha_i-r_i=0 \\ w=\sum_{i=1}^m \alpha_i y^{(i)} x^{(i)} \\ \sum_{i=1}^m \alpha_i y^{(i)}=0 \\ C-\alpha_i-r_i=0\end{gathered}

Further

W(α,r)=12wTw+Ci=1mζiwTw+i=1mαii=1mαiζii=1mriζi=i=1mαi12wTw+i=1mζi(Cαiri)=i=1mαi12wTw\begin{aligned} W(\alpha, r) & =\frac{1}{2} w^{\mathrm{T}} w+C \sum_{i=1}^m \zeta_i-w^{\mathrm{T}} w+\sum_{i=1}^m \alpha_i-\sum_{i=1}^m \alpha_i \zeta_i-\sum_{i=1}^m r_i \zeta_i \\ & =\sum_{i=1}^m \alpha_i-\frac{1}{2} w^{\mathrm{T}} w+\sum_{i=1}^m \zeta_i\left(C-\alpha_i-r_i\right) \\ & =\sum_{i=1}^m \alpha_i-\frac{1}{2} w^{\mathrm{T}} w \end{aligned}

Find the local maximum value W(a) of L with α\alpha

From the following calculation to find the local maximum of \alpha we have:

maxaW(α)=maxai=1mαi12wTw s. t. i=1mαiy(i)=0Cαiri=0,i=1,2,,mαi0,i=1,2,,mri0,i=1,2,,m\begin{aligned} & \max _a W(\alpha)=\max _a \sum_{i=1}^m \alpha_i-\frac{1}{2} w^{\mathrm{T}} w \\ & \text { s. t. } \sum_{i=1}^m \alpha_i y^{(i)}=0 \\ & C-\alpha_i-r_i=0, \quad i=1,2, \cdots, m \\ & \alpha_i \geqslant 0, \quad i=1,2, \cdots, m \\ & r_i \geqslant 0, \quad i=1,2, \cdots, m \end{aligned}

Then we can get the constraint condition:

0αiC0 \leq \alpha_i \leq C

Finally, we can get the final dual optimization problem:

maxαi=1mαi12i,j=1my(i)y(j)αiαj(x(i))Tx(j) s. t. 0αiC,i=1,2,,m and i=1mαiy(i)=0 \begin{aligned} & \max _\alpha \sum_{i=1}^m \alpha_i-\frac{1}{2} \sum_{i, j=1}^m y^{(i)} y^{(j)} \alpha_i \alpha_j\left(x^{(i)}\right)^{\mathrm{T}} x^{(j)} \\ & \text { s. t. } 0 \leqslant \alpha_i \leqslant C, \quad i=1,2, \cdots, m \quad \text { and } \quad \sum_{i=1}^m \alpha_i y^{(i)}=0 \end{aligned}

Now suppose α=(α11,α2,,αm)T\alpha^*=(\alpha_1^1*,\alpha_2^*,\cdots,\alpha_m^*)^T is the solution for dual optimization problem,
then in order for the solutions of dual optimization problem and the original optimization problem are identical,
the following KKT conditions need to be satisfied:

wL(w,b,ζ,α,r)=wi=1αiy(i)x(i)=0bL(w,b,ζ,α,r)=i=1αiy(i)=0ζL(w,b,ζα,r)=Cαr=0αigi(w)=αi[y(i)(wx(i)+b)1+ζi]=0,i=1,2,,mrihi(ζ)=riζi=0;i=1,2,,mgi(w)=y(i)(wx(i)+b)+1ζi0,i=1,2,,mhi(ζ)=ζi0;i=1,2,,mαi0,ri0;i=1,2,,m\begin{aligned} & \frac{\partial}{\partial w} L\left(w^*, b^*, \zeta^*, \alpha^*, r^*\right)=w^*-\sum\limits_{i=1}^{\infty} \alpha_i y^{(i)} x^{(i)}=0 \\ & \frac{\partial}{\partial b} L\left(w^*, b^*, \zeta^*, \alpha^*, r^*\right)=-\sum\limits_{i=1}^{\infty} \alpha_i y^{(i)}=0 \\ & \frac{\partial}{\partial \zeta} L\left(w^*, b^*, \zeta^* \cdot \alpha^*, r^*\right)=C-\alpha^*-r^*=0 \\ & \alpha_i^* g_i\left(w^*\right)=\alpha_i^*\left[y^{(i)}\left(w^* \cdot x^{(i)}+b^*\right)-1+\zeta_i\right]=0, \quad i=1,2, \cdots, m \\ & r_i^* h_i\left(\zeta^*\right)=r_i^* \zeta_i^*=0 ; \quad i=1,2, \cdots, m \\ & g_i\left(w^*\right)=-y^{(i)}\left(w^* \cdot x^{(i)}+b^*\right)+1-\zeta_i \leqslant 0, \quad i=1,2, \cdots, m \\ & h_i\left(\zeta^*\right)=-\zeta_i \leqslant 0 ; \quad i=1,2, \cdots, m \\ & \alpha_i^* \geqslant 0, \quad r_i^* \geqslant 0 ; \quad i=1,2, \cdots, m \end{aligned}

From the first equation we know that

w=i=1mαiy(i)x(i)w^*=\sum\limits_{i=1}^m\alpha_i^*y^{(i)}x^{(i)}

Hence there exists at least one 0 αjC\le \alpha_j^* \leq C .
Further, if there exists 0αjC0 \le \alpha_j^* \le C , then the corresponding rj>0r_j^* > 0
Besides, there must be ζj=0\zeta_j^* = 0 ;
so finally, at this point we know that there must be:

y(j)(wx(j)+b)=1y^{(j)}(w^* \cdot x^{(j)}+b^*)=1

Further

b=y(j)i=1mαiy(i)(x(i)x(j))b^*=y^{(j)}-\sum\limits_{i=1}^{m}\alpha_i^*y^{(i)}(x^{(i)} \cdot x^{(j)})

SMO

Cordinate Ascent

The initial position is moved in only one of the directions to solve for the optimal solution of the objective function

The SMO algorithm simply does the following:
Repeat till convergence {

  1. Select some pair αi\alpha_i and αj\alpha_j to update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum).
  2. Reoptimize W(α)W(\alpha) with respect to αi and αj\alpha_i \text{ and } \alpha_j , while holding all the other αks(ki,j)\alpha_{k}’s (k \neq i, j) fixed.
    }

Theory

The optimization problem to be solved (K()K()is any kernel function):

maxi=1mαi12i,j=1my(i)y(j)αiαjK(x(i),x(j))s.t.0αiC,i=1,2,,m and i=1mαiyi=0\max\limits_{i=1}^m\alpha_i-\frac{1}{2}\sum\limits_{i,j=1}^{m}y^{(i)}y^{(j)}\alpha_i\alpha_{j}K(x^{(i), x^{(j)}}) \\ s.t. 0 \leq \alpha_i \leq C,i=1,2,\cdots,m \text{ and } \sum\limits_{i=1}^{m}\alpha_{i}y^{i}=0
maxα1α2α1+α212α12K11α1α2y(1)y(2)K12α1y(1)i=3mαiy(1)K1i12α22K22α2y(2)i=3mαiy(i)K2i+Ψconstant  s.t. 0αiC,i=1,2α1y(1)+α2y(2)=i=3mα1y(i)=ζ\begin{aligned} & \max _{\alpha_1 \alpha_2} \alpha_1+\alpha_2-\frac{1}{2} \alpha_1^2 K_{11}-\alpha_1 \alpha_2 y^{(1)} y^{(2)} K_{12}-\alpha_1 y^{(1)} \sum_{i=3}^m \alpha_i y^{(1)} K_{1i}- \\ & \quad \frac{1}{2} \alpha_2^2 K_{22}-\alpha_2 y^{(2)} \sum_{i=3}^m \alpha_i y^{(i)} K_{2 i}+\Psi_{\text {constant }} \\ & \text { s.t. } 0 \leqslant \alpha_i \leqslant C, \quad i=1,2 \\ & \alpha_1 y^{(1)}+\alpha_2 y^{(2)}=-\sum_{i=3}^m \alpha_1 y^{(i)}=\zeta \end{aligned}

Among which Kij=K(x(i),x(j)),ψconstant is a constant number that is not related to α1 and α2K_{i j}=K(x^{(i)}, x^{(j)}), \psi_{constant}\text{ is a constant number that is not related to } \alpha_1 \text{ and } \alpha_2

Note that

g(x)=i=1mαiy(i)K(x,x(i))+bvi=j=3mαjy(j)K(x(i),x(j))=g(x(i))j=12αjy(j)K(x(i),x(j))b,i=1,2\begin{aligned} & g(x)=\sum_{i=1}^m \alpha_i y^{(i)} K\left(x, x^{(i)}\right)+b \\ & v_i=\sum_{j=3}^m \alpha_j y^{(j)} K\left(x^{(i)}, x^{(j)}\right)=g\left(x^{(i)}\right)-\sum_{j=1}^2 \alpha_j y^{(j)} K\left(x^{(i)}, x^{(j)}\right)-b, \quad i=1,2 \end{aligned}

Then the objective function could be rewritten as:

W(α1,α2)=α1+α212α12K11α1α2y(1)y12α22K22α2y(2)v2+Ψconsthut  \begin{aligned} W\left(\alpha_1, \alpha_2\right)= & \alpha_1+\alpha_2-\frac{1}{2} \alpha_1^2 K_{11}-\alpha_1 \alpha_2 y^{(1)} y \\ & \frac{1}{2} \alpha_2^2 K_{22}-\alpha_2 y^{(2)} v_2+\Psi_{\text {consthut }} \end{aligned}
Substitude α1=(ζα2y(2))y(1)\alpha_1=\left(\zeta-\alpha_2 y^{(2)}\right) y^{(1)} into the above equation, we can get an objective function with only one variable α2\alpha_2
W(α2)=(ζα2y(2))y(1)+α212(ζα2y(2))2K11(ζα2y(2))α2y(2)K12(ζα2y(2))v112α22K22α2y(2)v2\begin{aligned} W\left(\alpha_2\right)= & \left(\zeta-\alpha_2 y^{(2)}\right) y^{(1)}+\alpha_2-\frac{1}{2}\left(\zeta-\alpha_2 y^{(2)}\right)^2 K_{11}- \\ & \left(\zeta-\alpha_2 y^{(2)}\right) \alpha_2 y^{(2)} K_{12}-\left(\zeta-\alpha_2 y^{(2)}\right) v_1-\frac{1}{2} \alpha_2^2 K_{22}-\alpha_2 y^{(2)} v_2 \end{aligned}

The derivative of the above formula with respect to α\alpha is

Wα2=y(1)y(2)+1+ζy(2)K11α2K11+2α2K12ζy(2)K12+v1y(2)α2K22y(2)v2\begin{aligned} \frac{\partial W}{\partial \alpha_2}= & -y^{(1)} y^{(2)}+1+\zeta y^{(2)} K_{11}-\alpha_2 K_{11}+2 \alpha_2 K_{12}- \\ & \zeta y^{(2)} K_{12}+v_1 y^{(2)}-\alpha_2 K_{22}-y^{(2)} v_2 \end{aligned}

Let the above formula to be zero,we could get

α2=y(2)(y(2)y(1)+ζK11ζK12+v1v2)K112K12+K22\alpha_2=\frac{y^{(2)}\left(y^{(2)}-y^{(1)}+\zeta K_{11}-\zeta K_{12}+v_1-v_2\right)}{K_{11}-2 K_{12}+K_{22}}

note that

η=K112K12+K22Ei=g(x(i))y(i)=(j=1mαjy(j)K(x(i),x(j))+b)y(i),i=1,2\begin{aligned} & \eta=K_{11}-2 K_{12}+K_{22} \\ & E_i=g\left(x^{(i)}\right)-y^{(i)}=\left(\sum_{j=1}^m \alpha_j y^{(j)} K\left(x^{(i)}, x^{(j)}\right)+b\right)-y^{(i)}, \quad i=1,2 \end{aligned}

After initializing a set of α1old,α2old\alpha_1^{old},\alpha_2^{old},
by substituting the above equation and ζ=α1old𝑦(1)+α2oldy(2)\zeta=\alpha_1^{old}𝑦^{(1)}+\alpha_2^{old}y^{(2)} into the equation of α2\alpha_2, we get

α2new=y(2)η[y(2)y(1)+(a1(i) y(1)+a2old y(2))K11(α1old y(1)+α2old y(2))K12+g(x1)i=13α3old y(j)K1jbg(x2)+j=12αjold y(j)K2j+b]α2new =α2old +y(2)(E1E2)η \begin{aligned} & \alpha_2^{new}=\frac{y^{(2)}}{\eta}\left[y^{(2)}-y^{(1)}+\left(a_1^{\text {(i) }} y^{(1)}+a_2^{\text {old }} y^{(2)}\right) K_{11}-\right. \\ & \left(\alpha_1^{\text {old }} y^{(1)}+\alpha_2^{\text {old }} y^{(2)}\right) K_{12}+g\left(x_1\right)-\sum_{i=1}^3 \alpha_3^{\text {old }} y^{(j)} K_{1j} \\ & \left.b-g\left(x_2\right)+\sum_{j=1}^2 \alpha_j^{\text {old }} y^{(j)} K_{2 j}+b\right] \\ & \end{aligned}\\ \alpha_2^{\text {new }}=\alpha_2^{\text {old }}+\frac{y^{(2)}\left(E_1-E_2\right)}{\eta}

The solution of α1,α2\alpha_1,\alpha_2 can only lie on the line segment inside the box.
So the value of α2new\alpha_2^{new} must lie on the line where the line segment is located,
so it is also necessary to make α2new\alpha_2^{new} satisfy:

Lα2newHL \leq \alpha_2^{new} \leq H
where L and H are the two endpoints of the segment. if y(1)y(2)y^{(1)} \neq y^{(2)},then
L=max(0,α2oldα1old),H=min(C,C+α2oldα1old)L=\max(0,\alpha_2^{old}-\alpha_1^{old}), H=\min(C,C+\alpha_2^{old}-\alpha_{1}^{old})

if y(1)=y(2)y^{(1)} = y^{(2)},then

L=max(0,α2old+α1oldC),H=min(C,α2oldα+1old)L=\max(0,\alpha_2^{old}+\alpha_1^{old}-C), H=\min(C,\alpha_2^{old}-\alpha+{1}^{old})

Therefore,

α2new.clipped ={H,α2new >Hα2new ,L<α2new <HL,α2new L\alpha_2^{\text {new.clipped }}= \begin{cases}H, & \alpha_2^{\text {new }}>H \\ \alpha_2^{\text {new }}, & L<\alpha_2^{\text {new }}<H \\ L, & \alpha_2^{\text {new }} \leqslant L\end{cases}

Further,with α1=(ζα2y(2)),ζ=α1oldy(1)+α2oldy(2)\alpha_1=(\zeta-\alpha_{2}y^{(2)}),\zeta=\alpha_1^{old}y^{(1)}+\alpha_2^{old}y^{(2)}, we could get:

α1new=(α1oldy(1)+α2oldy(2)α2new,clippedy(2))y(1)=α1old+y(1)y(2)(α2oldα2new,clipped)\begin{aligned} \alpha_1^{new}&=(\alpha_1^{old}y^{(1)}+\alpha_2^{old}y^{(2)}-\alpha_2^{new,clipped}y^{(2)})y^{(1)}\\ &=\alpha_1^{old}+y^{(1)}y^{(2)}(\alpha_2^{old}-\alpha_2{new,clipped}) \end{aligned}

Solve the bias b

y(1)(wx(1)+b)=y(1)g(x(1))=y(1)(i=1mαiy(1)K1i+b)=1i=1mαiy(i)K1i+b=y(1)b1new =y(1)i=3mαiy(i)K1iα1new y(1)K11α2new. clipped y(2)K12E1=i=3mαiy(i)K1i+α1old y(1)K11+α2old y(2)K12+bold y(1)y^{(1)}\left(w^{\top} x^{(1)}+b\right)=y^{(1)} g\left(x^{(1)}\right)=y^{(1)}\left(\sum_{i=1}^m \alpha_i y^{(1)} K_{1 i}+b\right)=1 \\ \sum_{i=1}^m \alpha_i y^{(i)} K_{1 i}+b=y^{(1)} \\ b_1^{\text {new }}=y^{(1)}-\sum_{i=3}^m \alpha_i y^{(i)} K_{1 i}-\alpha_1^{\text {new }} y^{(1)} K_{11}-\alpha_2^{\text {new. clipped }} y^{(2)} K_{12} \\ E_1=\sum_{i=3}^m \alpha_i y^{(i)} K_{1 i}+\alpha_1^{\text {old }} y^{(1)} K_{11}+\alpha_2^{\text {old }} y^{(2)} K_{12}+b^{\text {old }}-y^{(1)}

The first two terms in the above formula could be rewritten as:

y(1)i=3mαiy(i)K1i=bold Ei+α1old y(1)K11+α2old y(2)K12y^{(1)}-\sum_{i=3}^m \alpha_i y^{(i)} K_{1 i}=b^{\text {old }}-E_i+\alpha_1^{\text {old }} y^{(1)} K_{11}+\alpha_2^{\text {old }} y^{(2)} K_{12}
b1new =bold E1y(1)K11(α1new α1old )y(2)K12(α2new. clipped α2old )b_1^{\text {new }}=b^{\text {old }}-E_1-y^{(1)} K_{11}\left(\alpha_1^{\text {new }}-\alpha_1^{\text {old }}\right)-y^{(2)} K_{12}\left(\alpha_2^{\text {new. clipped }}-\alpha_2^{\text {old }}\right)

When 0α2new,clipped C0 \le \alpha_2^{\text {new,clipped }} \le C

b2new =bold E2y(1)K12(α1new α1old )y(2)K22(α2new clipped α2old )b_2^{\text {new }}=b^{\text {old }}-E_2-y^{(1)} K_{12}\left(\alpha_1^{\text {new }}-\alpha_1^{\text {old }}\right)-y^{(2)} K_{22}\left(\alpha_2^{\text {new clipped }}-\alpha_2^{\text {old }}\right)

Therefore,

bnew ={b1new ,0<α1new <Cb2new ,0α2new, clipped C(b1new+b2new)2Otherb^{\text {new }}= \begin{cases}b_1^{\text {new }}, & 0<\alpha_1^{\text {new }}<C \\ b_2^{\text {new }}, & 0 \le \alpha_2^{\text {new, clipped }} \le C \\ \frac{(b_{1}^{new}+b_{2}^{new})}{2} & Other \end{cases}

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np


def compute_w(data_x, data_y, alphas):
p1 = data_y.reshape(-1, 1) * data_x
p2 = alphas.reshape(-1, 1) * p1
return np.sum(p2, axis=0)


def kernel(x1, x2):
return np.dot(x1, x2)


def f_x(data_x, data_y, alphas, x, b):
k = kernel(data_x, x)
r = alphas * data_y * k
return np.sum(r) + b


def compute_L_H(C, alpha_i, alpha_j, y_i, y_j):
L = np.max((0., alpha_j - alpha_i))
H = np.min((C, C + alpha_j - alpha_i))
if y_i == y_j:
L = np.max((0., alpha_i + alpha_j - C))
H = np.min((C, alpha_i + alpha_j))
return L, H


def compute_eta(x_i, x_j):
return 2 * kernel(x_i, x_j) - kernel(x_i, x_i) - kernel(x_j, x_j)


def compute_E_k(f_x_k, y_k):
return f_x_k - y_k


def clip_alpha_j(alpha_j, H, L):
if alpha_j > H:
return H
if alpha_j < L:
return L
return alpha_j


def compute_alpha_j(alpha_j, E_i, E_j, y_j, eta):
return alpha_j - (y_j * (E_i - E_j) / eta)


def compute_alpha_i(alpha_i, y_i, y_j, alpha_j, alpha_old_j):
return alpha_i + y_i * y_j * (alpha_old_j - alpha_j)


def compute_b1(b, E_i, y_i, alpha_i, alpha_old_i, x_i, y_j, alpha_j, alpha_j_old, x_j):
p1 = b - E_i - y_i * (alpha_i - alpha_old_i) * kernel(x_i, x_i)
p2 = y_j * (alpha_j - alpha_j_old) * kernel(x_i, x_j)
return p1 - p2


def compute_b2(b, E_j, y_i, alpha_i, alpha_old_i, x_i, x_j, y_j, alpha_j, alpha_j_old):
p1 = b - E_j - y_i * (alpha_i - alpha_old_i) * kernel(x_i, x_j)
p2 = y_j * (alpha_j - alpha_j_old) * kernel(x_j, x_j)
return p1 - p2


def clip_b(alpha_i, alpha_j, b1, b2, C):
if alpha_i > 0 and alpha_i < C:
return b1
if alpha_j > 0 and alpha_j < C:
return b2
return (b1 + b2) / 2


def select_j(i, m):
j = np.random.randint(m)
while i == j:
j = np.random.randint(m)
return j


def smo(C, tol, max_passes, data_x, data_y):
m, n = data_x.shape
b, passes = 0., 0
alphas = np.zeros(shape=(m))
alphas_old = np.zeros(shape=(m))
while passes < max_passes:
num_changed_alphas = 0
for i in range(m):
x_i, y_i, alpha_i = data_x[i], data_y[i], alphas[i]
f_x_i = f_x(data_x, data_y, alphas, x_i, b)
E_i = compute_E_k(f_x_i, y_i)
if ((y_i * E_i < -tol and alpha_i < C) or (y_i * E_i > tol and alpha_i > 0.)):
j = select_j(i, m)
x_j, y_j, alpha_j = data_x[j], data_y[j], alphas[j]
f_x_j = f_x(data_x, data_y, alphas, x_j, b)
E_j = compute_E_k(f_x_j, y_j)
alphas_old[i], alphas_old[j] = alpha_i, alpha_j
L, H = compute_L_H(C, alpha_i, alpha_j, y_i, y_j)
if L == H:
continue
eta = compute_eta(x_i, x_j)
if eta >= 0:
continue
alpha_j = compute_alpha_j(alpha_j, E_i, E_j, y_j, eta)
alpha_j = clip_alpha_j(alpha_j, H, L)
alphas[j] = alpha_j
if np.abs(alpha_j - alphas_old[j]) < 10e-5:
continue
alpha_i = compute_alpha_i(alpha_i, y_i, y_j, alpha_j, alphas_old[j])
b1 = compute_b1(b, E_i, y_i, alpha_i, alphas_old[i], x_i, y_i, alpha_j, alphas_old[j], x_j)
b2 = compute_b2(b, E_j, y_i, alpha_i, alphas_old[i], x_i, x_j, y_j, alpha_j, alphas_old[j])
b = clip_b(alpha_i, alpha_j, b1, b2, C)
num_changed_alphas += 1
alphas[i] = alpha_i

if num_changed_alphas == 0:
passes += 1
else:
passes = 0
return alphas, b

Reference

Andrew Ng, Machine Learning, Stanford University