K-Means

October 5, 2023 1 minute read

데이터를 K개의 군집으로 나누기 위해 각 데이터 포인트를 가장 가까운 군집 중심에 할당하고, 군집 중심을 반복적으로 업데이트하는 군집화 알고리즘

K-Means

Clustering with based on centroid(중심점) of cluster

샘플은 most nearly centrobaric point 를 가진 cluster 로 assignment (할당)

K-means Algorithm 은 사전에 number of cluster 에 대한 hyper-parameter, k 를 define 해야 한다.

EM Algorithm 을 통해, optimized cluster 에 converge(수렴) 할 때 까지 학습

EM Algorithm

Maximum likelihood (최대 가능도) 또는 maximum a posteriori 를 갖는, parameter(모수)의 assumption value 를 찾는 iterative algorithm.

EM algorithm 은 Expectation step 과 Maximization step 으로 구분

Expectation (기대값) step :
- 현재의 추정된 parameter 를 통해 샘플을 cluster 에 assignment 하는 단계
Maximization (최대화) step :
- Likelihood (로그 가능도) 의 기댓값을 maximization 하는 parameter 를 estimation 하는 단계

특정 distribution 에 대한 assumption 이 없는 Non-Parametric estimation 에서는 likelihood 의 개념이 없다.

Mean Shift 나 DBSCAN 의 estimate method of density 로 학습

K-means clustering 에서의 EM Algorithm 은,

Expectation step :
- Estimation 하고자 하는 parameter 는 cetroid 이므로, sample 을 군집으로 assignment 하는 단계
Maximization step :
- Likelihood 를 sample 이 cluster 에 속할 확률로 해석하여, cluster 에 assign 된 샘플을 base 로, 새로운 cetroid point 를 계산

E.g.

K-means Clustering 예제

E.g., K-means Clustering

number of cluster is 2

K = 2; 두 개의 클러스터링

처음 cluster’ centroid 는 set random

Samples 를 most nearly centroid 에 assignment 하여 cluster 를 생성 (Expectation)

Cluster 의 새로운 centroid 를 계산 (Maximization)

Sample 을 most nearly centroid 에 re-assignment 하여 cluster 생성 (Expectation)

Cluster 의 새로운 Centroid 를 계산 (Maximization)

Evaluation Metrics of Clustering

클러스터링 평가 지표

Silhouette coefficient

Most nearby other cluster’s distance 를 통해 계산

Silhouette coefficient have a between -1 to 1 value. nearby 1 is better

Silhouette analysis

clusters 들이 얼마나 efficient 하게 분리되어 있는 지를 보여준다.

Each sample 들이 가지고 있는 Silhouette coefficient 를 기반으로 한다.

전체 Silhouette coefficient’ average 가 클수록, 개별 cluster 의 average’s deviation 이 작을 수록 좋다.

Silhouette coefficient 의 average 가 크고, deviation 이 작도록, K-means algorithm 의 number of cluster, (k) 를 결정한다.

Figure 1

Figure 2

K-means’s uppermost limit

Number of cluster, set first value of centroid 에 따라 performance deviation(편차) 의 차이가 크다.
Cluster 의 size 나 density 가 다를 경우, 학습이 어려울 수 있다.
Data distribution 이 특이할 경우, cluster learning 이 어렵다

K-Means

K-Means

EM Algorithm

E.g.

E.g., K-means Clustering

number of cluster is 2

처음 cluster’ centroid 는 set random

Samples 를 most nearly centroid 에 assignment 하여 cluster 를 생성 (Expectation)

Cluster 의 새로운 centroid 를 계산 (Maximization)

Sample 을 most nearly centroid 에 re-assignment 하여 cluster 생성 (Expectation)

Cluster 의 새로운 Centroid 를 계산 (Maximization)

Evaluation Metrics of Clustering

Silhouette coefficient

Silhouette analysis

K-means’s uppermost limit

Leave a comment

You may also enjoy

Partial Derivative

Derivative

Mathematics GuideMap

Learning Rate