Bias and Variance Trade-off and Loss Function
모델의 복잡도를 조절하여 과소적합(bias)과 과적합(variance) 사이에서 최적의 균형을 찾는 과정
높은 편향은 모델이 너무 단순하여 데이터를 잘못 해석하는 반면, 높은 분산은 모델이 너무 복잡하여 특정 데이터에 과적합이 발생한다. 이러한 편향과 분산을 적절히 관리하는 것을 이야기한다.
이 문서에서 다루는 내용은 Bias-Variance Trade Off 에 대한 내용으로, 이론적 관점에서 모델의 오류 원인을 설명합니다. 실제 모델의 성능을 평가하는 경험적 관점에서의 Over / Under-fitting 에 대한 내용은 아래 페이지를 참조하세요.
Complexity of model
- more than parameter of model, linear to non-linear model 로 갈 수록 complexity 는 증가.
- model 이 complex 할 수록, learning data 를 더 완벽하게 하게 learning 한다.
- 학습 데이터 수에 따른 발생 가능한 Error
- 학습 데이터가 많을 때 : Under-fitting (결정 경계가 과도한 선형)
- 학습 데이터가 부족할 때 : Over-fitting (과적합)
What is different to Over / Under-fitting?
Over-fitted classification and regression models memorize the training data too well in comparison with correctly fitted models.
Over-fitting
Over-fitting is a machine learning behavior that occurs when the model is so closely aligned to the training data that it does not know how to respond to new data.
Because,
- The machine learning model is too complex; It memorizes very subtle patterns in the training data that don’t generalize well.
- The training data size is too small for the model complexity and/or contains large amounts of irrelevant information.
So,
You can prevent over-fitting by managing model complexity and improving the training data set.
When only looking at the computed error of a machine learning model for the training data, over-fitting is harder to detect than under-fitting. So, to avoid over-fitting, it is important to validate a machine learning model before using it on test data.
Error | Overfitting | Right Fit | Underfitting |
---|---|---|---|
Training | Low | Low | High |
Test | High | Low | High |
Computed error of over-fitted models for training data is low, whereas the error is high for test data.
And, How do I do ?
Because the fundamental problem of overloading has given the model too much freedom.
So, we can apply regularization that punishes as much as the complexity of the model.
Optimize 대상인, error function 을 다음과 같이 regularization 이 적용된 새로운 function 으로 바꾼다. 이 때 추가되는 term, $ {p} $ 을 penalty term 이라고 한다.
$ E^r(w)=E(w)+$ $ p $
Now, the model works for reducing error and penalty term.
penalty term 으로 어느 것을 사용하는 가에 따라서 regularization 의 특성도 달라진다.
In machine-learning,
Regularization 은 parameter 가 지나치게 큰 값을 갖지 못하게 한다.
- Regularization ; parameter shrinkage(수축) method.
- Lasso (Least Absolute Shrinkage and Selection Operator)
- Ridge regression
Underfitting
Because,
- Model’s complexity is very low.
- Trained with garbage data.
So,
You can change that input data’s feature or, higher model’s complexity then before.
Inductive learning
Bias and Variance Trade - off
- Bias (편향) : relative under-fitting, mean of the models predicted - Real (optimal) parameter = Model Accuracy
- Variance (분산) :
How to solve that trade-off?
- Raise to model complexity up
- Prevent to over-fitting
- Use verified data-sets
- K-fold cross validation
- Normalized loss function
Regularized loss function
- More higher models complexity is following to increasing models parameters
- If models complexity is higher, that will be lead to results that over-fitting
- So, If models complexity is pretty high, do learn significant parameters in the data-sets
- It means, make 0 what unnecessary parameter
Kind of Regularization
Regularization : $\hat{\beta}$ 의 위치를 (0,0)으로 repositioning
Scarcity of parameter : Lasso (L1) > Ridge (L2)
Lasso (L1) Regression
- $L=\sum_{i=1}^{n}(y_{i}-(\beta_{0}+\sum_{j=1}^{D}\beta_{j}x_{ij}))^2+$ {$\lambda\sum_{j=1}^{D}\left|\beta_{j}\right|$}
- MSE Loss 를 줄이지 못하면, Term of penalty 의 loss value 가 더 크게 작용함
- $\lambda$(Lambda) is part of parameter that controls the effects of regularization. (like loss function, $w$)
- Regularized expression is expressed by sum of the absolute values
Bata_hat (optimum) value → replace (0,0)
Ridge (L2) Regression
- $ L=\sum_{i=1}^{n} (y_{i}-( \beta_{0} + \sum_{j=1}^{D} \beta_{j} x_{ij}))^2+$ {$ \lambda \sum_{j=1}^{D} \beta_{j}^2$}
- MSE Loss 를 줄이지 못하면, Term of penalty, y의 loss value 가 더 크게 작용함
- $\lambda$(Lambda) is part of parameter that controls the effects of regularization. (like loss function, $w$)
- Regularized expression is expressed by sum of squares
Bata_hat (optimum) value → replace (0,0)
Leave a comment