Support Vector Machine (SVM)
데이터 포인트를 분류하기 위해 클래스 간의 최대 마진을 확보하는 초평면을 찾는 지도 학습 알고리즘
패턴 인식, 자료 분석을 위한 지도 학습 모델이며, 주로 분류와 회귀 분석을 위해 사용한다. 두 카테고리 중 어느 하나에 속한 데이터의 집합이 주어졌을 때, SVM 알고리즘은 주어진 데이터 집합을 바탕으로 하여 새로운 데이터가 어느 카테고리에 속할지 판단하는 비확률적 이진 선형 분류 모델을 만든다. 만들어진 분류 모델은 데이터가 사상된 공간에서 경계로 표현되는데 SVM 알고리즘은 그 중 가장 큰 폭을 가진 경계를 찾는 알고리즘이다.
Hyperplane
Hyperplane
-
P 차원에서의 Hyperplane 은 P-1 차원에서의 Flat한 Affine space (동족 좌표계)
-
Ex. 2-dimensional 에서의 hyperplane 은 line, 3-dimensional 에서는 surface
- $b+w_{1}X_{1}+w_{2}X_{2}+…+w_{p}X_{p}=b~+<w,X>=0$
- $b~+<w,X>=0$. It means
- Dot on the hyperplane : $f(x)=0$
- Between line to dot (0,0) : $\frac{|\beta_{1}\cdot a+\beta_{2}\cdot b-\sigma|}{\sqrt{(\beta_{1})^{2}+(\beta_{2})^{2}}}$
- Between line to dot (0,0) : $f(x)<0$, opposite : $f(x)>0$
- $w=(w_{1},w_{2},…,w_{p})$의 normal vector 는 hyperplane 과 orthogonal direction을 의미
Separating Hyperplane
- $f(X)=b+w_{1}X_{1}+w_{2}X_{2}+…+w_{p}X_{p}$
- $f(X)>0$ 인 점들과 $f(X)<0$ 인 점들을 분류
- $Y_{i}\cdot f(X_{i}) >0~\textrm{for all}~i$
- $f(X)=0$ 은 separating hyperplane을 의미
Maximal Margin Classifier
- Among every separating hyperplane, Figure out maximize gap or margin between binary classes
- It called Support Vector that sample what have an effect on hyperplane (결정 경계)
- $\underset{b,w_{1},…,w_{p}}{\textrm{max}}M, \textrm{subject to}~M_{i}\geq M \textrm{ for all }i=1,…,N$ (subject to = 제약 조건)
- ⇒ $\textrm{suject to }M_{i}={\frac{y_{i}(b+w_{1}x_{i 1}+…+w_{p}x_{ip})}{||w||}}$
- ⇒ $\underset{w,b}{\textrm{max}}{\frac{\hat{M}}{||w||}}$, $\textrm{subject to }y_{i}(b+w_{1}x_{i 1}+…+w_{p}x_{ip})\geq{\hat{M}=M||w||}, \textrm{ for all }i=1,…,N$
- ⇒ $\underset{w,b}{\textrm{max}}\ \frac{1}{||w||}\ \textrm{subject to}\ y_{i}(b+w_{1}x_{i 1}+\ldots+w_{p}x_{ip})\geq 1,\ \textrm{for all}\ i=1,\ldots,N$
- $\therefore~{\underset{w,b}{\textrm{min}}\frac{||w||^{2}}{2}}$, $\textrm{subject to }y_{i}(b+w_{1}x_{i 1}+…+w_{p}x_{ip})\geq 1,\textrm{ for all }i=1,…,N$
Lagrange Multiplier Method
constraint 이 있는 optimization problem 을, Lagrange multiplier term 을 추가해, constraint (subject) 없는 문제로 바꾸는 method
- Primal problem (원초 문제)
- $\underset{x}{\textrm{min}}~c^{T}x. \textrm{ subject to }Ax=b, ~Gx \leq h$
- Lagrange multiplier vector $u$ 와, $v \geq 0$ 을 도입해 Lagrange function $L$ 을 만듦
- $L(x,u,v) = c^Tx+u^T(Ax-b)+v^T(Gx-h) \leq c^Tx$
Lagrange Multiplier Method
- $f^*\geq \underset{x}{\textrm{min}} ~L(x,u,v)= \underset{x}{\textrm{min}}~c^Tx+u^T(Ax-b)+v^T(Gx-h)=g(u,v)$
- $f^*$ = optimized ($c^Tx$) value
- $g(u,v)$를 Lagrange Dual Function 라고 함
- Partial differentiation 의 results 가 0이 되는 point 에서 minimized 된 값을 가짐
- $\frac{\partial L}{\partial x}=c^T+u^TA+v^TG=0$
- $c=-A^Tu-G^Tv$
- $L(x,u,v)=-u^Tb-v^Th=g(u,v)$
Duality gap
- $f^* \geq \underset{x}{\textrm{min}} ~L(x,u,v)=g(u,v)$
- $g(u,v)$를 maximize 하는 것은 primal problem 의 optimum value 으로 approaching 하는 것
- between u, v 의 gap 이 existence 하면 week dual, non-existence 하면 strong dual ($f^*=g(u,v)$)
Lagrange Dual problem
- change minimize to maximize problem. (dual (max) ↔ primal (min))
- $\underset{u,v}{\textrm{max}}-u^Tb-v^Th, \textrm{ subject to }-A^Tu-G^Tv=c, v\geq0$
Slater’s conditions
- $\underset{x}{\textrm{min}}f(x)$
- $\textrm{subject to }h_i\le 0, ~i=1,…,m. ~I_j=0,~j=1,…,r.$
- Conditions 1 : Primal problem 이 convex.
- i.e., $f$ and $h_1,…,h_m$ are convex, $I_1,…,I_r$ are affine
- Conditions 2 : There exists at least one strictly feasible $x \in \mathbb{R}^n$
- i.e., $h_1(x)<0,…,h_m(x)<0~\textrm{and }l_1(x)=0,…,l_r(x)=0$
- If conditions 1 and 2 are satisfied, strong quality is satisfied.
Karush-Kuhn-Tucker conditions (KKT conditions)
- In strong duality’s problem, the following proposition is satisfied (required Slater’s conditions)
- $x^* \textrm{and}~u^, v^$are primal and dual solutions
- It will be same; $x^* \textrm{and}~u^, v^$are satisfy the KKT conditions
- Conditions of stationary
- $0 \in \partial (f(x)+\sum_{i=1}^{m}u_ih_i(x)+\sum_{j=1}^{r}v_jl_j(x))$
- Conditions of complementary slackness
- $u_i \cdot h_i(x)=0 \textrm{~for all}~i$
- $u_i \cdot h_i(x)=0 \textrm{~for all}~i$
Optimization for Maximal Margin Classifier
Dual form
- $\underset{\alpha}{\textrm{max}}\underset{w,b}{\textrm{min}}L(w,b,\alpha)=\underset{\alpha}{\textrm{max}}\underset{w,b}{\textrm{min}}\frac{||w||^2}{2}-\sum_{i=1}^N\alpha_i(y_i(w\cdot x_i+b)-1)$
- $\textrm{subject to ; }\alpha_i \geq 0 \textrm{ for all }i = 1,…,N$
- By Stationary conditions (in KKT conditions),
- $\frac{\partial L(w,b,\alpha_i)}{\partial w} = 0,~ ~ ~ \frac{\partial L(w,b,\alpha_i)}{\partial b}=0$
-
$ \therefore w = \sum^N_{i=1}\alpha_iy_ix_i,~ \sum^N_{i=1}\alpha_iy_i = 0 $
- $\underset{\alpha}{\textrm{max}}$${\sum^N_{i=1}\alpha_i - \frac{1}{2}\sum^N_{i=1}\sum^N_{j=1}\alpha_i \alpha_jy_iy_jx_i^Tx_j }$
- $\textrm{subject to ; }\alpha_i \ge 0,$ ${\sum^N_{i=1}\alpha_iy_i=0 }\ \textrm{for i in range(1, N)}$
- It can be solved by quadratic programming
- Calculate $w, b$ with using $\alpha_i$
- $w = \sum^N_{i=1}\alpha_iy_ix_i, ~ ~b=\frac{1}{N_{SV}}\sum^{N_{SV}}_{i=1}(y_i-w^Tx_i)$
- By Complementary slackness conditions (in KKT conditions),
- $\alpha_i(y_i(w^Tx_i+b)-1)=0 \textrm{~ ~for all }i$
- Therefore, Either $\alpha_i$ or $y_i(w^Tx_i+b)-1$ certainly be 0.
- The only observation that affects the decision boundary is the support vector. (결정 경계에 영향을 미치는 관측치는 support vector)
- So, That called support vector machine
Soft Margin Machine
Non-separable data and Noisy data
-
Linear Boundary 로는 perfectly 하게 나눌 수 없는 case
-
또는, divide 할 수 있지만 noisy sample 때문에 inefficiency 한 boundary formation 이 형성되는 경우
- Slack variable $\epsilon_i$ 를 도입해 해결, 이를 soft margin classifier이라 한다
Soft Margin Machine
- $\underset{w,b,\epsilon}{\textrm{min}}\frac{||w||^2}{2}$ ${+C\sum^N_{i=1}\epsilon_i }$ $\textrm{subject to: }y_i(b+wx_i)\ge 1$ ${- \epsilon_i, ~\epsilon_i \ge 0 }$ $\textrm{for all } i = 1, \ldots, N$
- $\underset{\alpha,\beta}{\textrm{max}}~\underset{w,b,\epsilon}{\textrm{min}}\frac{||w||^2}{2}+C\sum^N_{i=1}\epsilon_i $ ${-\sum^N_{i=1}\alpha_i[y_i(w\cdot x_i+b)-1+\epsilon_i]-\sum^N_{i=1}\beta_i\epsilon_i}$
- $\textrm{subject to ; }$ ${\alpha_i \ge0, \ \beta_i \ge 0},$ $\textrm{ for all }i \textrm{in range (1 , N)}$
- By Stationary conditions (in KKT conditions), $\ \frac{\partial L}{\partial w} = 0, \ \ \frac{\partial L}{\partial b} = 0,\ \ \frac{\partial L}{\partial \epsilon_i} = 0$
$ \therefore w =\sum^N_{i=1}\alpha_i y_i x_i,~\sum^N_{i=1}\alpha_iy_i=0,~$ ${\alpha_i = C - \beta_i} $
- $\underset{\alpha}{\textrm{max}}{\sum^N_{i=1}\alpha_i - \frac{1}{2}\sum^N_{i=1}\sum^N_{j=1}\alpha_i \alpha_jy_iy_jx_i^Tx_j }\ \textrm{subject to ; }$ ${C \ge \alpha_i \ge 0},$$~\sum^N_{i=1}\alpha_iy_i=0 \textrm{ ~for i 1,…,N}$
- By complementary slackness conditions, there are 2 different support vector method
- Unbounded SV : $C > \alpha_i > 0$ 의 sample on the margin
- Bounded SV : $\alpha_i = C~(\epsilon_i~!= 0)$ 의 sample in the margin
- $w = \sum^N_{i=1}\alpha_iy_ix_i, ~~b = \frac{1}{N_{USV}} \sum^{N_{USV}}_{i=1}(y_i-w^Tx_i)$ㅗ[
Hyper-parameter $C$ in SVM
- Increasing value of C
- Converge to 0 because increasingly $\epsilon$ ‘s influence. so decreasingly margin’s width
- decrease number of support vector. so it can figure out decision boundary with small sample.
- Increasingly Variance, Decreasingly Bias
- Decreasing value of C
- Increasingly margins width
- Every sample will be support vector
- Increasingly Bias, Decreasingly Variance
Classfication new data for SVM
- 지금까지는, 주어진 데이터로 decision boundary’s hyperplane 을 찾는 process
- $f(x) = w^Tx + b = \sum_{i\in S} \alpha_i y_i x_i^T x + b$
- If predicted value is bigger than 0, Classification $\hat{y}=1$
- elif predicted value is smaller than 0, Classification $\hat{y}=-1$
- Not many computations
Non-linear SVM
mapping function
- Nonlinear structure 의 데이터를 move to more than high dimension 하여, 그 space 에서 classification
- input space 에 exist data 를 move to feature space 하는 mapping $(\phi)$
Kernel trick
- Mapping function 도입 시 operation quantity 가 대폭 증가함 $(\phi(x_i)^T\phi(x_j))$
- High dimension mapping 과 inner product 를 한 번에 solve 하기위해 Kernel 도입
- Example. $K(x_i,x_j)=\phi(x_i)^T\phi(x_j)=(Ax_i)^T(Ax_j)=x_j^TA^TAx_j \textrm{(move to dimension m to n)}$
- Mercer’s Theorem
- In case of $K(x_i,x_j)$ , It always values have more than 0
- Positive semi-definite matrix conditions
- $K(x_i,x_j)=K(x_j,x_i)$
- Symmetric matrix conditions
If satisfied there all conditions, There are exist something $\phi$ what satisfied $K(x_i,x_j)=\phi(x_i)^T\phi(x_j)$
- In case of $K(x_i,x_j)$ , It always values have more than 0
- Any function that satisfied conditions can be used as a kernel function
- Linear : $K(x_i,x_j) = X_i^Tx_j$
- Polynomial : $K(x_i,x_j)=(x_i^Tx_j+c)^d$
- Gaussian : $K(x_i,x_j) = \textrm{exp}{-\frac{|x_i-x_j|^2_{2}}{2\sigma^2}}$
- Radial : $K(x_i,x_j) = \textrm{exp}{-\gamma\sum^p_{k=1}(x_{ik}-x_{jk})^2 }$
- Example. ${K(x_i,x_j) = (x_i^Tx_j)^2}$
- $k(<x_1,x_2>,<z_1,z_2>) = (x\cdot z)^2=(x_1z_1+x_2z_2)^2$ $={<x^2_1,\sqrt{2}x_1x_2,~,x^2_2>}$ $\cdot<z_1^2,\sqrt{2}z_1z_2,~,z^2_2>= $ ${\phi(x_1,x_2)}\cdot\phi(z_1,z_2)$
SVM with Kernel trick
- $\underset{\alpha}{\textrm{max}}\sum^N_{i=1}\alpha_i~-\frac{1}{2}\sum^N_{i=1}\sum^N_{j=1}\alpha_i\alpha_jy_iy_j$ ${K(x_i,x_j)}$ $\textrm{subject to ; }\alpha_i \ge 0, ~\sum^N_{i=1}\alpha_iy_i=0~~\textrm{for i (i , N)}$
- $w=\sum^N_{i=1}\alpha_iy_i$${\phi(x_i)}$, $b=\frac{1}{N_{SV}}\sum_{i=1}^{N_{SV}}(y_i-(\sum^{N_{SV}}_{j=1}\alpha_jy_j)$ ${K(x_i,x_j)}$ $))$
- Kernel function is used for classification, so no definition of mapping function is required
- $f(\phi(x))=w^T\phi(x)+b=\sum_{i\in S}\alpha_iy_i\phi(x_i)^T\phi(x)+b=\sum_{i \in S}\alpha_iy_i$${K(x_i,x)} +b$
- $f(\phi(x))=w^T\phi(x)+b=\sum_{i\in S}\alpha_iy_i\phi(x_i)^T\phi(x)+b=\sum_{i \in S}\alpha_iy_i$${K(x_i,x)} +b$
Multiclass SVM
-
OVA (One versus All)
K 개의 2-Class SVM 을 학습하며 각자의 class 와 나머지 class 로 나눔
$\hat{f}_k(x)$ 의 값이 가장 큰 class 로 classification
-
OVO (one versus One)
$\binom{K}{2}$ 개의 pairwise classifier $(\hat{f}_{kl}(x))$ 를 학습
pairwise competition 를 가장 많이 이긴 클래스로 classify
SVM vs. Logistic Regression
- Class 가 거의 separable 하면, SVM > LR
- 아닐경우, LR (with ridge penalty) == SVM
- Probability value 를 measure 하고 싶으면, Use LR
- Non-linear boundary 에는, Kernel SVM 이 part of calculating 에서 더 좋음
Leave a comment