Bias and Variance Trade-off and Loss Function

September 24, 2023 3 minute read

모델의 복잡도를 조절하여 과소적합(bias)과 과적합(variance) 사이에서 최적의 균형을 찾는 과정

높은 편향은 모델이 너무 단순하여 데이터를 잘못 해석하는 반면, 높은 분산은 모델이 너무 복잡하여 특정 데이터에 과적합이 발생한다. 이러한 편향과 분산을 적절히 관리하는 것을 이야기한다.

이 문서에서 다루는 내용은 Bias-Variance Trade Off 에 대한 내용으로, 이론적 관점에서 모델의 오류 원인을 설명합니다. 실제 모델의 성능을 평가하는 경험적 관점에서의 Over / Under-fitting 에 대한 내용은 아래 페이지를 참조하세요.

Over and Under-fitting

Complexity of model

more than parameter of model, linear to non-linear model 로 갈 수록 complexity 는 증가.
model 이 complex 할 수록, learning data 를 더 완벽하게 하게 learning 한다.
학습 데이터 수에 따른 발생 가능한 Error
- 학습 데이터가 많을 때 : Under-fitting (결정 경계가 과도한 선형)
- 학습 데이터가 부족할 때 : Over-fitting (과적합)
What is different to Over / Under-fitting?

Over fitted classification and regression models memorize the training data too well in comparison with correctly fitted models.

Over-fitted classification and regression models memorize the training data too well in comparison with correctly fitted models.

Over-fitting

Over-fitting is a machine learning behavior that occurs when the model is so closely aligned to the training data that it does not know how to respond to new data.

Because,

The machine learning model is too complex; It memorizes very subtle patterns in the training data that don’t generalize well.
The training data size is too small for the model complexity and/or contains large amounts of irrelevant information.

So,

You can prevent over-fitting by managing model complexity and improving the training data set.

When only looking at the computed error of a machine learning model for the training data, over-fitting is harder to detect than under-fitting. So, to avoid over-fitting, it is important to validate a machine learning model before using it on test data.

Error	Overfitting	Right Fit	Underfitting
Training	Low	Low	High
Test	High	Low	High

Computed error of over-fitted models for training data is low, whereas the error is high for test data.

untitle

And, How do I do ?

Because the fundamental problem of overloading has given the model too much freedom.

So, we can apply regularization that punishes as much as the complexity of the model.

Optimize 대상인, error function 을 다음과 같이 regularization 이 적용된 새로운 function 으로 바꾼다. 이 때 추가되는 term, $ {p} $ 을 penalty term 이라고 한다.

$ E^r(w)=E(w)+$ $ p $

Now, the model works for reducing error and penalty term.

penalty term 으로 어느 것을 사용하는 가에 따라서 regularization 의 특성도 달라진다.

In machine-learning,

Regularization 은 parameter 가 지나치게 큰 값을 갖지 못하게 한다.

Regularization ; parameter shrinkage(수축) method.
- Lasso (Least Absolute Shrinkage and Selection Operator)
- Ridge regression

Underfitting

Because,

Model’s complexity is very low.
Trained with garbage data.

So,

You can change that input data’s feature or, higher model’s complexity then before.

untitle

Inductive learning

Screenshot_2023-03-16_at_12 11 16_PM

Bias and Variance Trade - off

Bias (편향) : relative under-fitting, mean of the models predicted - Real (optimal) parameter = Model Accuracy
Variance (분산) :

ml101

How to solve that trade-off?

Raise to model complexity up
Prevent to over-fitting
- Use verified data-sets
- K-fold cross validation
- Normalized loss function

Regularized loss function

More higher models complexity is following to increasing models parameters
If models complexity is higher, that will be lead to results that over-fitting
So, If models complexity is pretty high, do learn significant parameters in the data-sets
It means, make 0 what unnecessary parameter

Kind of Regularization

Regularization : $\hat{\beta}$ 의 위치를 (0,0)으로 repositioning

Scarcity of parameter : Lasso (L1) > Ridge (L2)

Loss Funcition

Lasso (L1) Regression

$L=\sum_{i=1}^{n}(y_{i}-(\beta_{0}+\sum_{j=1}^{D}\beta_{j}x_{ij}))^2+$ {$\lambda\sum_{j=1}^{D}\left|\beta_{j}\right|$}
MSE Loss 를 줄이지 못하면, Term of penalty 의 loss value 가 더 크게 작용함
$\lambda$(Lambda) is part of parameter that controls the effects of regularization. (like loss function, $w$)
Regularized expression is expressed by sum of the absolute values

Bata_hat (optimum) value → replace (0,0)

Ridge (L2) Regression

$ L=\sum_{i=1}^{n} (y_{i}-( \beta_{0} + \sum_{j=1}^{D} \beta_{j} x_{ij}))^2+$ {$ \lambda \sum_{j=1}^{D} \beta_{j}^2$}
MSE Loss 를 줄이지 못하면, Term of penalty, y의 loss value 가 더 크게 작용함
$\lambda$(Lambda) is part of parameter that controls the effects of regularization. (like loss function, $w$)
Regularized expression is expressed by sum of squares

Bata_hat (optimum) value → replace (0,0)

Bias and Variance Trade-off and Loss Function

Complexity of model

What is different to Over / Under-fitting?

Over-fitting

Because,

So,

And, How do I do ?

In machine-learning,

Underfitting

Because,

So,

Inductive learning

Bias and Variance Trade - off

How to solve that trade-off?

Regularized loss function

Kind of Regularization

Lasso (L1) Regression

Ridge (L2) Regression

Leave a comment

You may also enjoy

Partial Derivative

Derivative

Mathematics GuideMap

Learning Rate