3 minute read

모델의 복잡도를 조절하여 과소적합(bias)과 과적합(variance) 사이에서 최적의 균형을 찾는 과정

높은 편향은 모델이 너무 단순하여 데이터를 잘못 해석하는 반면, 높은 분산은 모델이 너무 복잡하여 특정 데이터에 과적합이 발생한다. 이러한 편향과 분산을 적절히 관리하는 것을 이야기한다.

이 문서에서 다루는 내용은 Bias-Variance Trade Off 에 대한 내용으로, 이론적 관점에서 모델의 오류 원인을 설명합니다. 실제 모델의 성능을 평가하는 경험적 관점에서의 Over / Under-fitting 에 대한 내용은 아래 페이지를 참조하세요.

Over and Under-fitting

Complexity of model



  • more than parameter of model, linear to non-linear model 로 갈 수록 complexity 는 증가.
  • model 이 complex 할 수록, learning data 를 더 완벽하게 하게 learning 한다.
  • 학습 데이터 수에 따른 발생 가능한 Error
    • 학습 데이터가 많을 때 : Under-fitting (결정 경계가 과도한 선형)
    • 학습 데이터가 부족할 때 : Over-fitting (과적합)



    What is different to Over / Under-fitting?



Over fitted classification and regression models memorize the training data too well in comparison with correctly fitted models.

Over-fitted classification and regression models memorize the training data too well in comparison with correctly fitted models.


Over-fitting



Over-fitting is a machine learning behavior that occurs when the model is so closely aligned to the training data that it does not know how to respond to new data.


Because,

  • The machine learning model is too complex; It memorizes very subtle patterns in the training data that don’t generalize well.
  • The training data size is too small for the model complexity and/or contains large amounts of irrelevant information.


So,

You can prevent over-fitting by managing model complexity and improving the training data set.

When only looking at the computed error of a machine learning model for the training data, over-fitting is harder to detect than under-fitting. So, to avoid over-fitting, it is important to validate a machine learning model before using it on test data.

Error Overfitting Right Fit Underfitting
Training Low Low High
Test High Low High


Computed error of over-fitted models for training data is low, whereas the error is high for test data.

untitle


And, How do I do ?

Because the fundamental problem of overloading has given the model too much freedom.

So, we can apply regularization that punishes as much as the complexity of the model.

Optimize 대상인, error function 을 다음과 같이 regularization 이 적용된 새로운 function 으로 바꾼다. 이 때 추가되는 term, $ {p} $penalty term 이라고 한다.

$ E^r(w)=E(w)+$ $ p $

Now, the model works for reducing error and penalty term.

penalty term 으로 어느 것을 사용하는 가에 따라서 regularization 의 특성도 달라진다.


In machine-learning,

Regularization 은 parameter 가 지나치게 큰 값을 갖지 못하게 한다.

  • Regularization ; parameter shrinkage(수축) method.
    • Lasso (Least Absolute Shrinkage and Selection Operator)
    • Ridge regression



Underfitting

Because,

  • Model’s complexity is very low.
  • Trained with garbage data.

So,

You can change that input data’s feature or, higher model’s complexity then before.

untitle



Inductive learning

Screenshot_2023-03-16_at_12 11 16_PM



Bias and Variance Trade - off

  • Bias (편향) : relative under-fitting, mean of the models predicted - Real (optimal) parameter = Model Accuracy
  • Variance (분산) :

ml101

ml101


How to solve that trade-off?

  • Raise to model complexity up
  • Prevent to over-fitting
    • Use verified data-sets
    • K-fold cross validation
    • Normalized loss function



Regularized loss function

  • More higher models complexity is following to increasing models parameters
  • If models complexity is higher, that will be lead to results that over-fitting
  • So, If models complexity is pretty high, do learn significant parameters in the data-sets
  • It means, make 0 what unnecessary parameter



Kind of Regularization

Regularization : $\hat{\beta}$ 의 위치를 (0,0)으로 repositioning

Scarcity of parameter : Lasso (L1) > Ridge (L2)

Loss Funcition


Lasso (L1) Regression

  • $L=\sum_{i=1}^{n}(y_{i}-(\beta_{0}+\sum_{j=1}^{D}\beta_{j}x_{ij}))^2+$ {$\lambda\sum_{j=1}^{D}\left|\beta_{j}\right|$}
  • MSE Loss 를 줄이지 못하면, Term of penalty 의 loss value 가 더 크게 작용함
  • $\lambda$(Lambda) is part of parameter that controls the effects of regularization. (like loss function, $w$)
  • Regularized expression is expressed by sum of the absolute values

Bata_hat (optimum) value → replace (0,0)

Bata_hat (optimum) value → replace (0,0)


Ridge (L2) Regression

  • $ L=\sum_{i=1}^{n} (y_{i}-( \beta_{0} + \sum_{j=1}^{D} \beta_{j} x_{ij}))^2+$ {$ \lambda \sum_{j=1}^{D} \beta_{j}^2$}
  • MSE Loss 를 줄이지 못하면, Term of penalty, y의 loss value 가 더 크게 작용함
  • $\lambda$(Lambda) is part of parameter that controls the effects of regularization. (like loss function, $w$)
  • Regularized expression is expressed by sum of squares

Bata_hat (optimum) value → replace (0,0)

Bata_hat (optimum) value → replace (0,0)

Leave a comment