Theory

Hyperplane:

w^Tx+b=0

$w$ : Weighted Coefficient
b: Intercept
x: Sample Points
y = +1 : Positive sample; y = -1: Negative Sample

Functional Margin

\hat{\gamma^{i}} = y^i(w^Tx^i+b)

Geometric Margin

The actual distance from the sample point to the hyperplane

x^i

: Sample point A

{\gamma}^i

: Modulus of the vector

\overrightarrow{BA}

x^i-{\gamma}_i \cdot \frac{W}{||W||}

For B is on the line

w^T(x^i-{\gamma}_i \cdot \frac{W}{||W||}+b=0)

Hence when $y_A=1$ , the Geometric Margin is

{\gamma}^i=\frac{w^Tx^i+b}{||w||}={(\frac{w}{||w||})}^{T}x^i+\frac{b}{||w||}

More Generally

{\gamma}^i=y^i({(\frac{w}{||w||})}^{T}x^i+\frac{b}{||w||})

The functional margin is related to the geometric margin

\gamma=\frac{\hat{\gamma}}{||w||}

Kernels

Instead of utilizing SVMs with the original input attributes x, we might opt to train with certain features φ(x).
This can be achieved by revising our prior algorithm and substituting every instance of x with $\phi(x)$ .
Since the algorithm is expressed solely in terms of inner products $⟨x,z⟩$ ,
we could replace those inner products with $⟨\phi(x),\phi(z)⟩$ .
In particular, when provided with a feature mapping φ, we establish the corresponding Kernel as follows:

K(x, z) = \phi(x)^{T}\phi(z).

At every instance where we had ⟨x, z⟩ in our algorithm previously,
we could effortlessly substitute it with K(x,z) and our updated algorithm would then be trained using the features φ.

Now, let’s talk about a slightly different view of kernels.
If φ(x) and φ(z) are close together, then we might expect $K(x,z) = \phi(x)^{T}\phi(z)$ to be large.
Conversely, if φ(x) and φ(z) are far apart—say nearly orthogonal to each other—then $K(x, z) = \phi(x)^{T}\phi(z)$ will be small.
So, we can think of K(x, z) as some measurement of how similar are φ(x) and φ(z), or of how similar are x and z.

K(x,z)=exp(\frac{-||x-z||^2}{2\sigma^2})=exp(\frac{-||x||^2}{2\sigma^2})exp(\frac{-||z||^2}{2\sigma^2})exp(\frac{\left\langle x, z\right\rangle}{\sigma^2})

where the Taylor expansion for the third term is

exp(\frac{ \left\langle x, z \right\rangle}{\sigma^2})=1+\frac{\left\langle x,z\right\rangle}{\sigma^2}+\frac{\left\langle x,z\right\rangle^2}{2\sigma^4}+\frac{\left\langle x,z\right\rangle^3}{3!\sigma^6}+\cdots=\sum\limits_{i=0}^n\frac{\left\langle x,z\right\rangle^n}{n!(\sigma)^n}

Due to the existence of the Taylor expansion, the mapping from low to infinite dimensions is achieved with the help of the Gaussian Kernel or Radial Basis Function (RBF)

Common Kernel Functions

Linear Kernel

K(x,z)=\left\langle x,z \right\rangle

Polynomial Kernel

K(x,z)=(\gamma\left\langle x,z \right\rangle + r)^d

Gaussian Kernel

K(x,z)=exp(-\gamma||x-z||^2)

Sigmoid Kernel

K(x,z)=tanh(\gamma\left\langle x,z \right\rangle + r)

Lagrange Multipliers

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints
The basic idea is to convert a constrained problem into a form such that the derivative test of an unconstrained problem can still be applied.
Thus, for the conditional extrema of the multivariate function $z = f(x,y,z,\cdots)$ under multiple constraints $\phi(x,y,\ldots) = 0, \quad \psi(x,y,\ldots) = 0$ ,
the steps for solving using the Lagrange multiplier method can be summarized as:

Make a Lagrangian function
$F(x,y,z,\cdots,\lambda,\nu,\cdots)=f(x,y,z,\cdots)+ \mu\phi(x,y,\cdots)+\mu\phi(x,y,\cdots)+\mu\phi(x,y,\cdots)$
Find the stationary points of the multivariate function $F(x,t,z,\cdots,\lambda,\mu,\cdots)$
Solve the system of equations to find the stationary point $(x_0,y_0,z_0,\cdots,\lambda_0,\mu_0,\cdots)$ ,

then $f(x_0,y_0,z_0,\cdots)$ is the possible conditional extreme value

\left\{ \begin{array}{c} F_x=0 \\ F_y=0 \\ \cdots \\ F_{\lambda}=0\\ \cdots \end{array} \right.

Now

f(x_0,y_0,z_0)

may be the possible conditional extrema.

Generalized Lagranigian

L(w,\alpha,\beta)=f(w)+\sum\limits_{i=1}^k\alpha_ig_i(w)+\sum\limits_{i=1}^l\beta_ih_i(w)

Here, the

\alpha_i

’s and

\beta_i

’s are the Lagrange multipliers Consider the quantity

\theta_{P}(w)= \max \limits_{\alpha,\beta : \alpha_i \geq 0} L(w,\alpha,\beta)

P stands for “Primal”
Primal constraints: $g_i(w) \leq 0$ and $h_i(w)=0$
If the constraints are satisfied for a particular value of $w$

\theta_p(w)=\max\limits_{\alpha,\beta:\alpha_i \geq 0}(\sum\limits_{i=1}^{k}\alpha_ig_i(w)+\sum\limits_{i=1}^{l}\beta_{i} h_i(w))

which means

\theta_p(w)=f(w)+0

If the constraints are not satisfied

\theta_p(w)=\infty

Dual Optimization Problem

Define

\theta_D(\alpha,\beta)=\min\limits_{w}L(w,\alpha,\beta)

D stands for dual We can now pose the dual optimization problem:

d^{*}=\max\limits_{\alpha,\beta:\alpha_i \geq 0}\theta_{D}(\alpha,\beta)=\max\limits_{\alpha,\beta:\alpha_i \geq 0}\min\limits_{w}L(w,\alpha,\beta)

The primal and the dual problem are related like this:

d^{*}=\max\limits_{\alpha,\beta: \alpha_i \geq 0}L(w,\alpha,\beta) \leq \min\limits_{w} \max\limits_{\alpha,\beta:\alpha_i \geq 0}L(w,\alpha,\beta)=p^{*}

Karush-Kuhn-Tucker(KKT)

Suppose $f(w)$ and the $g_i(w)$ ’s are convex, and the $h_i$ ’s are affine.
Suppose further that the constraints $g_i$ are strictly feasible;
this means that there exists some $w$ so that $g_i(w) \le 0$ for all $i$ .
Under the above assumptions, there must exist $w_*, \alpha_*, \beta_*$ so that $w_*$ is the solution to the primal problem,
and moreover $\alpha_*,\beta_*$ are the solution to the dual problem,
Furthermore, $p_* = d_* = L(w_*, \alpha_*, \beta_*)$ .

w_*, \alpha_* and \beta_*

satisfy the Karush-Kuhn-Tucker (KKT) conditions, which are as follows:

\begin{aligned} \frac{\partial}{\partial w_i} \mathcal{L}\left(w^*, \alpha^*, \beta^*\right) & =0, i=1, \ldots, d \\ \frac{\partial}{\partial \beta_i} \mathcal{L}\left(w^*, \alpha^*, \beta^*\right) & =0, i=1, \ldots, l \\ \alpha_i^* g_i\left(w^*\right) & =0, i=1, \ldots, k \\ g_i\left(w^*\right) & \leq 0, i=1, \ldots, k \\ \alpha^* & \geq 0, i=1, \ldots, k\end{aligned}

\alpha_i^* g_i(w^*) =0

is called the KKT dual complementarty

Specifically, it implies that if $α_i^{*} \ge 0$ , then $g_i(w^*) = 0$ .
This will be key for showing that the SVM has only a small number of “support vectors”.

Optimization

Hard Margin

Find the local minimal value W(a) of L with w and b

The final optimization objective of the SVM hard margin is

\min\limits_{w,b}\frac{1}{2}||w||^2

s.t. y^{(i)}(w^Tx^{(i)}+b) \geq 1, i=1,\cdots,n

The Generalized Lagranigian Function is:

L(w,b,\alpha)=\frac{1}{2}||w||^2-\sum\limits_{i=1}^m\alpha_i[y^{(i)}(w^{T}x^{(i)}+b)-1]

where \alpha_i≥0 is the Lagrangian multiplier

g_i(w)=-y^{(i)}(w^{T}x^{(i)}+b)+1 \leq 0

Then the original problem’s dual optimization problem is:

d^* = \max\limits_{\alpha,\alpha_i \geq 0}\min\limits_{w,b}L(w,b,\alpha)

To solve this,we need to calculate the partial derivatives for w and b and make it to zero.

\frac{\partial L}{\partial w}=w-\sum\limits_{i=1}^m \alpha_{i}y^{(i)}x^{(i)}=0

\frac{\partial L}{\partial b}=-\sum\limits_{i=1}^m\alpha_{i}y^{(i)}=0

Then we have

w=\sum\limits_{i=1}^m\alpha_i y^{(i)}

W(\alpha)=\min\limits_{w,b}L(w,b,\alpha)=\sum\limits_{i=1}^{m}\alpha_i-\frac{1}{2}\sum\limits_{i,j=1}^{m}y_{(i)}y_{(j)}\alpha_i\alpha_j(x^{(i)})^{T}x^{(j)}

Solution Steps:

\begin{aligned} L(w, b, a)&=\frac{1}{2} w^{T} w-\sum_{i=1}^m \alpha,\left[y^{(1)}\left(w^{T} x^{(i)}+b\right)-1\right] \\ & =\frac{1}{2} w^{\mathrm{T}} w-\sum_{i=1}^m \alpha_i y^{(i)} w^{\mathrm{T}} x^{(i)}-\sum_{i=1}^m \alpha_i y^{(i)} b+\sum_{i=1}^{\infty} \alpha_i \\ & =\frac{1}{2} w^{T} w-w^{T} \sum_{i=1}^m \alpha_i y^{(i)} x^{(i)}-b \sum_{i=1}^m a_i y^{(i)}+\sum_{i=1}^m \alpha_i \\ & \end{aligned}

\begin{aligned} W(\alpha) & =\frac{1}{2} w^{T} w-w^{T} w+\sum_{i=1}^m \alpha_i \\ & =\sum_{i=1}^m \alpha_i-\frac{1}{2} w^{\top} w \\ & =\sum_{i=1}^m \alpha_i-\frac{1}{2} \sum_{i, j=1}^m \alpha_i \alpha_i y^{(i)} y^{(j)}\left(x^{(i)}\right)^T x^{(j)} \end{aligned}

Hence

W(\alpha)=\min\limits_{w, b}\mathcal{L}(w, b, \alpha)=\sum\limits_{i=1}^n \alpha_i-\frac{1}{2} \sum\limits_{i, j=1}^n y^{(i)} y^{(j)} \alpha_i \alpha_j\left(x^{(i)}\right)^T x^{(j)}

Find the local maximum value W(a) of L with a

Consider an optimization Problem:

\begin{aligned} & \max W(\alpha)=\max \sum_{i=1}^m \alpha_1-\frac{1}{2} \sum\limits_{i,j=1}^{m} y^{(i)} y^{(j)} \alpha_i \alpha_j\left(x^{(i)}\right)^{\top} x^{(0)} \\ & \text { s. t. } \alpha_i \geqslant 0, \quad i=1,2, \cdots, m\end{aligned}

With the constraint Condition:

\sum\limits_{i=1}^{m} \alpha_{i} y^{(i)}=0

Furthermore,suppose $\alpha^*=(\alpha_1^*,\alpha_2^*,\cdots, \alpha_m^*)^T$ is the solution of the dual optimization problem,
then in order for the solutions of dual optimization problem and the original optimization problem are identical,
the following KKT conditions need to be satisfied:

\begin{aligned} & \frac{\partial}{\partial w} L\left(w^*, b^*, \alpha^*\right)=w^*-\sum_{i=1}^m \alpha_i^* y^{(i)} x^{(i)}=0 \\ & \frac{\partial}{\partial b} L\left(w^*, b^*, \alpha^*\right)=-\sum_{i=1}^m \alpha_i^* y^{(i)}=0 \\ & \alpha_i^* g_i\left(w^*\right)=\alpha_i^*\left[y^{(i)}\left(w^* \cdot x^{(i)}+b^*\right)-1\right]=0, \quad i=1,2, \cdots, m \\ & g_i\left(w^*\right)=-y^{(i)}\left(w^* \cdot x^{(i)}+b^*\right)+1 \leqslant 0, \quad i=1,2, \cdots, m \\ & \alpha_i^* \geqslant 0, \quad i=1,2, \cdots, m \end{aligned}

Therefore,

w^*=\sum\limits_{i=1}^m\alpha_i^*y^{(i)}x^{(i)}

So at least there exists one

\alpha_j^ \ge 0{*}

if $\alpha_j^ \ge 0{*}$ , then we must have:

y^{(j)}(w^* \cdot x^{(j)}+b^*)=1

From this we got that the sample point

x^{(j)}

is a support vector, and this shows that the hyperplane is only related to the support vectors.

Moreover,for $(y^{(j)}=1)$

b^* = y^{(j)}-\sum\limits_{i=1}^m\alpha_{i}^*y^{(i)}(x^(i) \cdot x^(j))

Soft Margin

The final optimization objective of the SVM hard margin is

\min\limits_{w,b}\frac{1}{2}||w||^2 + C\sum\limits_{i=1}^{m} \zeta_i

s.t. y^{(i)}(w^Tx^{(i)}+b)\geq 1- \zeta_i,\zeta_i \geq 0, i=1,2,\cdots,m

Then the generalized lagranigian function is:

L(w, b, \zeta, \alpha, r)=\frac{1}{2}\|w\|^2+C \sum_{i=1}^m \zeta_i-\sum_{i=1}^m \alpha_i\left[y^{(i)}\left(w^{T} x^{(i)}+b\right)-1+\zeta_i\right]-\sum_{i=1}^m r_i \zeta_i

Among which $\alpha_i \geq 0$ and $r_i \geq 0$ and also note that

\left\{\begin{array}{l}g_i(w)=-y^{(i)}\left(w^{T} x^{(i)}+b\right)+1-\zeta_i \leqslant 0 \\ h_i(\zeta)=-\zeta_i \leqslant 0 ; \quad i=1,2, \cdots, m\end{array}\right.

So we can get the dual optimization problem of the primal optimization problem

d^*=\max\limits_{\alpha,r}\min\limits_{w,b,\zeta}L(w,b,\zeta,\alpha,r)

Find the local minimal value W(a,r) of L with w ,b and $\zeta$

\frac{\partial L}{\partial w}=w-\sum\limits_{i=1}^{m}\alpha_{i}y^{(i)}x^{(i)}=0

\begin{gathered}\frac{\partial L}{\partial b}=-\sum_{i=1}^m \alpha_i y^{(i)}=0 \\ \frac{\partial L}{\partial \zeta_i}=C-\alpha_i-r_i=0 \\ w=\sum_{i=1}^m \alpha_i y^{(i)} x^{(i)} \\ \sum_{i=1}^m \alpha_i y^{(i)}=0 \\ C-\alpha_i-r_i=0\end{gathered}

Further

\begin{aligned} W(\alpha, r) & =\frac{1}{2} w^{\mathrm{T}} w+C \sum_{i=1}^m \zeta_i-w^{\mathrm{T}} w+\sum_{i=1}^m \alpha_i-\sum_{i=1}^m \alpha_i \zeta_i-\sum_{i=1}^m r_i \zeta_i \\ & =\sum_{i=1}^m \alpha_i-\frac{1}{2} w^{\mathrm{T}} w+\sum_{i=1}^m \zeta_i\left(C-\alpha_i-r_i\right) \\ & =\sum_{i=1}^m \alpha_i-\frac{1}{2} w^{\mathrm{T}} w \end{aligned}

Find the local maximum value W(a) of L with $\alpha$

From the following calculation to find the local maximum of \alpha we have:

\begin{aligned} & \max _a W(\alpha)=\max _a \sum_{i=1}^m \alpha_i-\frac{1}{2} w^{\mathrm{T}} w \\ & \text { s. t. } \sum_{i=1}^m \alpha_i y^{(i)}=0 \\ & C-\alpha_i-r_i=0, \quad i=1,2, \cdots, m \\ & \alpha_i \geqslant 0, \quad i=1,2, \cdots, m \\ & r_i \geqslant 0, \quad i=1,2, \cdots, m \end{aligned}

Then we can get the constraint condition:

0 \leq \alpha_i \leq C

Finally, we can get the final dual optimization problem:

\begin{aligned} & \max _\alpha \sum_{i=1}^m \alpha_i-\frac{1}{2} \sum_{i, j=1}^m y^{(i)} y^{(j)} \alpha_i \alpha_j\left(x^{(i)}\right)^{\mathrm{T}} x^{(j)} \\ & \text { s. t. } 0 \leqslant \alpha_i \leqslant C, \quad i=1,2, \cdots, m \quad \text { and } \quad \sum_{i=1}^m \alpha_i y^{(i)}=0 \end{aligned}

Now suppose $\alpha^*=(\alpha_1^1*,\alpha_2^*,\cdots,\alpha_m^*)^T$ is the solution for dual optimization problem,
then in order for the solutions of dual optimization problem and the original optimization problem are identical,
the following KKT conditions need to be satisfied:

\begin{aligned} & \frac{\partial}{\partial w} L\left(w^*, b^*, \zeta^*, \alpha^*, r^*\right)=w^*-\sum\limits_{i=1}^{\infty} \alpha_i y^{(i)} x^{(i)}=0 \\ & \frac{\partial}{\partial b} L\left(w^*, b^*, \zeta^*, \alpha^*, r^*\right)=-\sum\limits_{i=1}^{\infty} \alpha_i y^{(i)}=0 \\ & \frac{\partial}{\partial \zeta} L\left(w^*, b^*, \zeta^* \cdot \alpha^*, r^*\right)=C-\alpha^*-r^*=0 \\ & \alpha_i^* g_i\left(w^*\right)=\alpha_i^*\left[y^{(i)}\left(w^* \cdot x^{(i)}+b^*\right)-1+\zeta_i\right]=0, \quad i=1,2, \cdots, m \\ & r_i^* h_i\left(\zeta^*\right)=r_i^* \zeta_i^*=0 ; \quad i=1,2, \cdots, m \\ & g_i\left(w^*\right)=-y^{(i)}\left(w^* \cdot x^{(i)}+b^*\right)+1-\zeta_i \leqslant 0, \quad i=1,2, \cdots, m \\ & h_i\left(\zeta^*\right)=-\zeta_i \leqslant 0 ; \quad i=1,2, \cdots, m \\ & \alpha_i^* \geqslant 0, \quad r_i^* \geqslant 0 ; \quad i=1,2, \cdots, m \end{aligned}

From the first equation we know that

w^*=\sum\limits_{i=1}^m\alpha_i^*y^{(i)}x^{(i)}

Hence there exists at least one 0 $\le \alpha_j^* \leq C$ .
Further, if there exists $0 \le \alpha_j^* \le C$ , then the corresponding $r_j^* > 0$
Besides, there must be $\zeta_j^* = 0$ ;
so finally, at this point we know that there must be:

y^{(j)}(w^* \cdot x^{(j)}+b^*)=1

Further

b^*=y^{(j)}-\sum\limits_{i=1}^{m}\alpha_i^*y^{(i)}(x^{(i)} \cdot x^{(j)})

SMO

Cordinate Ascent

The initial position is moved in only one of the directions to solve for the optimal solution of the objective function

The SMO algorithm simply does the following:
Repeat till convergence {

Select some pair $\alpha_i$ and $\alpha_j$ to update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum).
Reoptimize $W(\alpha)$ with respect to $\alpha_i \text{ and } \alpha_j$ , while holding all the other $\alpha_{k}’s (k \neq i, j)$ fixed.
}

Theory

The optimization problem to be solved ( $K()$ is any kernel function):

\max\limits_{i=1}^m\alpha_i-\frac{1}{2}\sum\limits_{i,j=1}^{m}y^{(i)}y^{(j)}\alpha_i\alpha_{j}K(x^{(i), x^{(j)}}) \\ s.t. 0 \leq \alpha_i \leq C,i=1,2,\cdots,m \text{ and } \sum\limits_{i=1}^{m}\alpha_{i}y^{i}=0

\begin{aligned} & \max _{\alpha_1 \alpha_2} \alpha_1+\alpha_2-\frac{1}{2} \alpha_1^2 K_{11}-\alpha_1 \alpha_2 y^{(1)} y^{(2)} K_{12}-\alpha_1 y^{(1)} \sum_{i=3}^m \alpha_i y^{(1)} K_{1i}- \\ & \quad \frac{1}{2} \alpha_2^2 K_{22}-\alpha_2 y^{(2)} \sum_{i=3}^m \alpha_i y^{(i)} K_{2 i}+\Psi_{\text {constant }} \\ & \text { s.t. } 0 \leqslant \alpha_i \leqslant C, \quad i=1,2 \\ & \alpha_1 y^{(1)}+\alpha_2 y^{(2)}=-\sum_{i=3}^m \alpha_1 y^{(i)}=\zeta \end{aligned}

Among which $K_{i j}=K(x^{(i)}, x^{(j)}), \psi_{constant}\text{ is a constant number that is not related to } \alpha_1 \text{ and } \alpha_2$

Note that

\begin{aligned} & g(x)=\sum_{i=1}^m \alpha_i y^{(i)} K\left(x, x^{(i)}\right)+b \\ & v_i=\sum_{j=3}^m \alpha_j y^{(j)} K\left(x^{(i)}, x^{(j)}\right)=g\left(x^{(i)}\right)-\sum_{j=1}^2 \alpha_j y^{(j)} K\left(x^{(i)}, x^{(j)}\right)-b, \quad i=1,2 \end{aligned}

Then the objective function could be rewritten as:

\begin{aligned} W\left(\alpha_1, \alpha_2\right)= & \alpha_1+\alpha_2-\frac{1}{2} \alpha_1^2 K_{11}-\alpha_1 \alpha_2 y^{(1)} y \\ & \frac{1}{2} \alpha_2^2 K_{22}-\alpha_2 y^{(2)} v_2+\Psi_{\text {consthut }} \end{aligned}

Substitude

\alpha_1=\left(\zeta-\alpha_2 y^{(2)}\right) y^{(1)}

into the above equation, we can get an objective function with only one variable

\alpha_2

\begin{aligned} W\left(\alpha_2\right)= & \left(\zeta-\alpha_2 y^{(2)}\right) y^{(1)}+\alpha_2-\frac{1}{2}\left(\zeta-\alpha_2 y^{(2)}\right)^2 K_{11}- \\ & \left(\zeta-\alpha_2 y^{(2)}\right) \alpha_2 y^{(2)} K_{12}-\left(\zeta-\alpha_2 y^{(2)}\right) v_1-\frac{1}{2} \alpha_2^2 K_{22}-\alpha_2 y^{(2)} v_2 \end{aligned}

The derivative of the above formula with respect to $\alpha$ is

\begin{aligned} \frac{\partial W}{\partial \alpha_2}= & -y^{(1)} y^{(2)}+1+\zeta y^{(2)} K_{11}-\alpha_2 K_{11}+2 \alpha_2 K_{12}- \\ & \zeta y^{(2)} K_{12}+v_1 y^{(2)}-\alpha_2 K_{22}-y^{(2)} v_2 \end{aligned}

Let the above formula to be zero,we could get

\alpha_2=\frac{y^{(2)}\left(y^{(2)}-y^{(1)}+\zeta K_{11}-\zeta K_{12}+v_1-v_2\right)}{K_{11}-2 K_{12}+K_{22}}

note that

\begin{aligned} & \eta=K_{11}-2 K_{12}+K_{22} \\ & E_i=g\left(x^{(i)}\right)-y^{(i)}=\left(\sum_{j=1}^m \alpha_j y^{(j)} K\left(x^{(i)}, x^{(j)}\right)+b\right)-y^{(i)}, \quad i=1,2 \end{aligned}

After initializing a set of $\alpha_1^{old},\alpha_2^{old}$ ,
by substituting the above equation and $\zeta=\alpha_1^{old}𝑦^{(1)}+\alpha_2^{old}y^{(2)}$ into the equation of $\alpha_2$ , we get

\begin{aligned} & \alpha_2^{new}=\frac{y^{(2)}}{\eta}\left[y^{(2)}-y^{(1)}+\left(a_1^{\text {(i) }} y^{(1)}+a_2^{\text {old }} y^{(2)}\right) K_{11}-\right. \\ & \left(\alpha_1^{\text {old }} y^{(1)}+\alpha_2^{\text {old }} y^{(2)}\right) K_{12}+g\left(x_1\right)-\sum_{i=1}^3 \alpha_3^{\text {old }} y^{(j)} K_{1j} \\ & \left.b-g\left(x_2\right)+\sum_{j=1}^2 \alpha_j^{\text {old }} y^{(j)} K_{2 j}+b\right] \\ & \end{aligned}\\ \alpha_2^{\text {new }}=\alpha_2^{\text {old }}+\frac{y^{(2)}\left(E_1-E_2\right)}{\eta}

The solution of $\alpha_1,\alpha_2$ can only lie on the line segment inside the box.
So the value of $\alpha_2^{new}$ must lie on the line where the line segment is located,
so it is also necessary to make $\alpha_2^{new}$ satisfy:

L \leq \alpha_2^{new} \leq H

where L and H are the two endpoints of the segment. if

y^{(1)} \neq y^{(2)}

,then

L=\max(0,\alpha_2^{old}-\alpha_1^{old}), H=\min(C,C+\alpha_2^{old}-\alpha_{1}^{old})

if $y^{(1)} = y^{(2)}$ ,then

L=\max(0,\alpha_2^{old}+\alpha_1^{old}-C), H=\min(C,\alpha_2^{old}-\alpha+{1}^{old})

Therefore,

\alpha_2^{\text {new.clipped }}= \begin{cases}H, & \alpha_2^{\text {new }}>H \\ \alpha_2^{\text {new }}, & L<\alpha_2^{\text {new }}<H \\ L, & \alpha_2^{\text {new }} \leqslant L\end{cases}

Further,with $\alpha_1=(\zeta-\alpha_{2}y^{(2)}),\zeta=\alpha_1^{old}y^{(1)}+\alpha_2^{old}y^{(2)}$ , we could get:

\begin{aligned} \alpha_1^{new}&=(\alpha_1^{old}y^{(1)}+\alpha_2^{old}y^{(2)}-\alpha_2^{new,clipped}y^{(2)})y^{(1)}\\ &=\alpha_1^{old}+y^{(1)}y^{(2)}(\alpha_2^{old}-\alpha_2{new,clipped}) \end{aligned}

Solve the bias b

y^{(1)}\left(w^{\top} x^{(1)}+b\right)=y^{(1)} g\left(x^{(1)}\right)=y^{(1)}\left(\sum_{i=1}^m \alpha_i y^{(1)} K_{1 i}+b\right)=1 \\ \sum_{i=1}^m \alpha_i y^{(i)} K_{1 i}+b=y^{(1)} \\ b_1^{\text {new }}=y^{(1)}-\sum_{i=3}^m \alpha_i y^{(i)} K_{1 i}-\alpha_1^{\text {new }} y^{(1)} K_{11}-\alpha_2^{\text {new. clipped }} y^{(2)} K_{12} \\ E_1=\sum_{i=3}^m \alpha_i y^{(i)} K_{1 i}+\alpha_1^{\text {old }} y^{(1)} K_{11}+\alpha_2^{\text {old }} y^{(2)} K_{12}+b^{\text {old }}-y^{(1)}

The first two terms in the above formula could be rewritten as:

y^{(1)}-\sum_{i=3}^m \alpha_i y^{(i)} K_{1 i}=b^{\text {old }}-E_i+\alpha_1^{\text {old }} y^{(1)} K_{11}+\alpha_2^{\text {old }} y^{(2)} K_{12}

b_1^{\text {new }}=b^{\text {old }}-E_1-y^{(1)} K_{11}\left(\alpha_1^{\text {new }}-\alpha_1^{\text {old }}\right)-y^{(2)} K_{12}\left(\alpha_2^{\text {new. clipped }}-\alpha_2^{\text {old }}\right)

When $0 \le \alpha_2^{\text {new,clipped }} \le C$

b_2^{\text {new }}=b^{\text {old }}-E_2-y^{(1)} K_{12}\left(\alpha_1^{\text {new }}-\alpha_1^{\text {old }}\right)-y^{(2)} K_{22}\left(\alpha_2^{\text {new clipped }}-\alpha_2^{\text {old }}\right)

Therefore,

b^{\text {new }}= \begin{cases}b_1^{\text {new }}, & 0<\alpha_1^{\text {new }}<C \\ b_2^{\text {new }}, & 0 \le \alpha_2^{\text {new, clipped }} \le C \\ \frac{(b_{1}^{new}+b_{2}^{new})}{2} & Other \end{cases}

Code

import numpy as np


def compute_w(data_x, data_y, alphas):
    p1 = data_y.reshape(-1, 1) * data_x
    p2 = alphas.reshape(-1, 1) * p1
    return np.sum(p2, axis=0)


def kernel(x1, x2):
    return np.dot(x1, x2)


def f_x(data_x, data_y, alphas, x, b):
    k = kernel(data_x, x)
    r = alphas * data_y * k
    return np.sum(r) + b


def compute_L_H(C, alpha_i, alpha_j, y_i, y_j):
    L = np.max((0., alpha_j - alpha_i))
    H = np.min((C, C + alpha_j - alpha_i))
    if y_i == y_j:
        L = np.max((0., alpha_i + alpha_j - C))
        H = np.min((C, alpha_i + alpha_j))
    return L, H


def compute_eta(x_i, x_j):
    return 2 * kernel(x_i, x_j) - kernel(x_i, x_i) - kernel(x_j, x_j)


def compute_E_k(f_x_k, y_k):
    return f_x_k - y_k


def clip_alpha_j(alpha_j, H, L):
    if alpha_j > H:
        return H
    if alpha_j < L:
        return L
    return alpha_j


def compute_alpha_j(alpha_j, E_i, E_j, y_j, eta):
    return alpha_j - (y_j * (E_i - E_j) / eta)


def compute_alpha_i(alpha_i, y_i, y_j, alpha_j, alpha_old_j):
    return alpha_i + y_i * y_j * (alpha_old_j - alpha_j)


def compute_b1(b, E_i, y_i, alpha_i, alpha_old_i, x_i, y_j, alpha_j, alpha_j_old, x_j):
    p1 = b - E_i - y_i * (alpha_i - alpha_old_i) * kernel(x_i, x_i)
    p2 = y_j * (alpha_j - alpha_j_old) * kernel(x_i, x_j)
    return p1 - p2


def compute_b2(b, E_j, y_i, alpha_i, alpha_old_i, x_i, x_j, y_j, alpha_j, alpha_j_old):
    p1 = b - E_j - y_i * (alpha_i - alpha_old_i) * kernel(x_i, x_j)
    p2 = y_j * (alpha_j - alpha_j_old) * kernel(x_j, x_j)
    return p1 - p2


def clip_b(alpha_i, alpha_j, b1, b2, C):
    if alpha_i > 0 and alpha_i < C:
        return b1
    if alpha_j > 0 and alpha_j < C:
        return b2
    return (b1 + b2) / 2


def select_j(i, m):
    j = np.random.randint(m)
    while i == j:
        j = np.random.randint(m)
    return j


def smo(C, tol, max_passes, data_x, data_y):
    m, n = data_x.shape
    b, passes = 0., 0
    alphas = np.zeros(shape=(m))
    alphas_old = np.zeros(shape=(m))
    while passes < max_passes:
        num_changed_alphas = 0
        for i in range(m):
            x_i, y_i, alpha_i = data_x[i], data_y[i], alphas[i]
            f_x_i = f_x(data_x, data_y, alphas, x_i, b)
            E_i = compute_E_k(f_x_i, y_i)
            if ((y_i * E_i < -tol and alpha_i < C) or (y_i * E_i > tol and alpha_i > 0.)):
                j = select_j(i, m)
                x_j, y_j, alpha_j = data_x[j], data_y[j], alphas[j]
                f_x_j = f_x(data_x, data_y, alphas, x_j, b)
                E_j = compute_E_k(f_x_j, y_j)
                alphas_old[i], alphas_old[j] = alpha_i, alpha_j
                L, H = compute_L_H(C, alpha_i, alpha_j, y_i, y_j)
                if L == H:
                    continue
                eta = compute_eta(x_i, x_j)
                if eta >= 0:
                    continue
                alpha_j = compute_alpha_j(alpha_j, E_i, E_j, y_j, eta)
                alpha_j = clip_alpha_j(alpha_j, H, L)
                alphas[j] = alpha_j
                if np.abs(alpha_j - alphas_old[j]) < 10e-5:
                    continue
                alpha_i = compute_alpha_i(alpha_i, y_i, y_j, alpha_j, alphas_old[j])
                b1 = compute_b1(b, E_i, y_i, alpha_i, alphas_old[i], x_i, y_i, alpha_j, alphas_old[j], x_j)
                b2 = compute_b2(b, E_j, y_i, alpha_i, alphas_old[i], x_i, x_j, y_j, alpha_j, alphas_old[j])
                b = clip_b(alpha_i, alpha_j, b1, b2, C)
                num_changed_alphas += 1
                alphas[i] = alpha_i

        if num_changed_alphas == 0:
            passes += 1
        else:
            passes = 0
    return alphas, b

Reference

Andrew Ng, Machine Learning, Stanford University

Theory

Hyperplane:

Functional Margin

Geometric Margin

Kernels

Common Kernel Functions

Linear Kernel

Polynomial Kernel

Gaussian Kernel

Sigmoid Kernel

Lagrange Multipliers

Generalized Lagranigian

Dual Optimization Problem

Karush-Kuhn-Tucker(KKT)

Optimization

Hard Margin

Find the local minimal value W(a) of L with w and b

Find the local maximum value W(a) of L with a

Soft Margin

Find the local minimal value W(a,r) of L with w ,b and ζ\zetaζ

Find the local maximum value W(a) of L with α\alpha α

SMO

Cordinate Ascent

Theory

Solve the bias b

Code

Reference

Find the local minimal value W(a,r) of L with w ,b and $\zeta$

Find the local maximum value W(a) of L with $\alpha$