Imbalanced Data Clustering using Equilibrium K-Means

He, Yudong

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2402

Computer Science > Machine Learning

Title: Imbalanced Data Clustering using Equilibrium K-Means

Authors: Yudong He

(Submitted on 22 Feb 2024 (this version), latest version 28 Mar 2024 (v2))

Abstract: Imbalanced data, characterized by an unequal distribution of data points across different clusters, poses a challenge for traditional hard and fuzzy clustering algorithms, such as hard K-means (HKM, or Lloyd's algorithm) and fuzzy K-means (FKM, or Bezdek's algorithm). This paper introduces equilibrium K-means (EKM), a novel and simple K-means-type algorithm that alternates between just two steps, yielding significantly improved clustering results for imbalanced data by reducing the tendency of centroids to crowd together in the center of large clusters. We also present a unifying perspective for HKM, FKM, and EKM, showing they are essentially gradient descent algorithms with an explicit relationship to Newton's method. EKM has the same time and space complexity as FKM but offers a clearer physical meaning for its membership definition. We illustrate the performance of EKM on two synthetic and ten real datasets, comparing it to various clustering algorithms, including HKM, FKM, maximum-entropy fuzzy clustering, two FKM variations designed for imbalanced data, and the Gaussian mixture model. The results demonstrate that EKM performs competitively on balanced data while significantly outperforming other techniques on imbalanced data. For high-dimensional data clustering, we demonstrate that a more discriminative representation can be obtained by mapping high-dimensional data via deep neural networks into a low-dimensional, EKM-friendly space. Deep clustering with EKM improves clustering accuracy by 35% on an imbalanced dataset derived from MNIST compared to deep clustering based on HKM.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2402.14490 [cs.LG]
	(or arXiv:2402.14490v1 [cs.LG] for this version)

Submission history

From: Yudong He [view email]
[v1] Thu, 22 Feb 2024 12:27:38 GMT (3964kb,D)
[v2] Thu, 28 Mar 2024 08:36:27 GMT (3964kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2402.14490v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Imbalanced Data Clustering using Equilibrium K-Means

Submission history