Boosting

October 4, 2023 2 minute read

약한 학습기(weak learners)를 순차적으로 훈련, 이전 모델의 오류를 보완하면서 점진적으로 성능을 향상시키는 앙상블 학습 기법

이를 통해 bias 와 variance 를 동시에 줄이려고 시도한다.

Intro

Bagging 과 마찬가지로, 다양한 algorithm 과 regression, classification 문제 모두 적용 가능하다.
Decision tree 를 사용한 boosting algorithm
- AdaBoost
- Gradient Boosting (GBM)
- XGBoost
- Light GBM
이전 Step 의 tree information 을 활용해 sequentially tree를 만든다.

Number of trees B
- Decision tree 를 sequentially 생성 할 때, 생성 갯수를 결정
- B 값이 커질수록, Over-fitting 문제 발생
Number of split
- each decision tree 가 어느 정도의 depth 를 가지는지 결정
- Split 이 한 번인 (=2 terminal node) decision tree 를 특별히 stump 로 부른다.

The first boosting algorithm
이전 decision tree 가 잘못 predict 한 data 에 큰 weight ($w_i$)를 부여해, 다음 결정 트리가 더 집중할 수 있도록 순차적으로 학습
Decision tree 로는 stump 구조 사용
B 개의 decision tree 별로 계산된 model weight ($c_b$)를 합산해 최종 model 생성
b 번째 반복에서의 모델은 다음처럼 decision tree 의 linear combination

Loss function 은 sum of exponential loss 으로 정의. $(w^1_i=1,$ ${w^b_i=e^{-y_i\hat f_{b-1}(x_i)}}$ $\textrm{ assumption})$

$E =\sum_{y_i=\hat f_b(x_i)}w^b_ie^{-c_b} +\sum_{y_i\neq\hat f_b(x_i)}w^b_ie^{c_b}=\sum^N_{i=1}w^b_ie^{-c_b}+\sum_{y_i\neq\hat f_b(x_i)}w^b_i(e^{c_b}-e^{-c_b})$
E 를 minimize 하는 $\hat f_b$ 는 $\sum_{y_i\neq\hat f_{b}(x_i)}w^b_i$ 를 minimize 하는 model 임으로, 잘못 predict 한 data 의 weight 를 고려한 decision tree 가 학습된다.
Learning to model weight $c_b$

Next step 의 data weight : ${w_i^{b+1}}$ $=e^{-y_i\hat f_b(x_i)}=e^{-y_i(\hat f_{b-1}(x_i)+c_b \hat f_b(x_i))}$
$ = ${w^b_i \cdot \exp(-y_ic_b\hat f_b(x_i))}$
- 이를 B번 반복하여 최종 모델을 생성

Initial data weight : $w^1_i=1$
Calculate error of decision tree : $\epsilon_b =(\sum_{y_i\neq\hat f_b(x_i)}w^b_i)/(\sum^N_{i=1}w^b_i)$
Calculate weight of decision tree : $c_b=\frac12\log((1-\epsilon_b)/\epsilon_b)$
Update weight of data : $w^b_i=w^{b-1}_i \cdot\exp(-c_by_i\hat f_b(x_i))$
Reiterate for number of B to step 2-4
Create an end result model : ${\hat f(x)=\textrm{sign}(\sum^B_{b-1}c_b\cdot\hat f_b(x))}$

현재 모델의 residual(오차)를 줄여주는 방향으로 decision tree 를 학습하는 방법론
첫 번째 decision tree 는 하나의 leaf node structure ( =Predict by entirely data’s average)
이후에는 일반적으로 stump 보다는 더 complicated tree structure 를 사용한다.
Loss function 으로는 보통 differentiable MSE loss, L1 loss or Logistic loss 를 사용한다.
Residual 은 real value 와 predicted value 의 차이 ($y_i-\hat f(x_i)$)로, negative gradient 와 같은 의미

$\frac{\partial L}{\partial \hat f(x_i)}=\frac{\partial(y_i-\hat f(x_i))^2}{\partial \hat f(x_i)}=\hat f(x_i)-y_i$
Defined loss function 에 대한 negative gradient 로 residual 계산
이전 모델의 residual 을 minimize 하는 decision tree $\gamma$ 학습 ( j = index of terminal node )

$\gamma^b_j=\arg \underset{\gamma}{\min} \sum_{x_i \in R^b_j}L(y_i,\hat f_{b-1}(x_i)+\gamma)$
학습한 decision tree 를 그대로 합치면 over-fitting 문제가 발생할 수 있음
Therefore, Parameter of learning rate, $v$ 를 도입함

$\hat f_b(x_i)=\hat f_{b-1}(x_i)+v\sum^{Jb}_{j=1}\gamma^b_j \ \mathbb{I}\ (x \in R^b_j)$
Gradient Descent algorithm 과 동일하다.

GBM algorithm 의 performance 와 speed 부분에서 향상된 algorithm
기존 GBM 은 learning data 에 대한 residual 을 계속 줄여 over-fitting 되기 쉬움
Add normalization term to loss function
- $\Omega(f)= \gamma^T+\frac12 \lambda||c||^2 \textrm{(T = number of terminal node, c = weight of each nodes) }$
Split finding algorithm 을 통해 efficient of calculating 를 높임
- 기존에는 모든 feature 를 split 기준으로 탐색했었음
- 이에 대한 approximation algorithm 을 제안해 속도를 향상시킴

기존의 boosting algorithm 은 B번의 반복 학습 때 마다 전체 data sets 를 살펴본다.
이 과정에서 대부분의 calculation cost 가 발생함
결정 트리 학습에 사용되는 데이터 수를 다음의 방법들로 줄임
- GOSS (Gradient-based One-Side Sampling)
  - 작은 Gradient 값을 가진 샘플들을 제외하는 방법론]
- EFB (Exclusive Feature Bundling)
  - Mutually exclusive (상호 배타적) feature 를 묶어, 탐색해야 하는 feature 수를 감소시킴