Learning-Augmented K-Means Clustering Using Dimensional Reduction
- URL: http://arxiv.org/abs/2401.03198v1
- Date: Sat, 6 Jan 2024 12:02:33 GMT
- Title: Learning-Augmented K-Means Clustering Using Dimensional Reduction
- Authors: Issam K.O Jabari, Shofiyah, Pradiptya Kahvi S, Novi Nur Putriwijaya,
and Novanto Yudistira
- Abstract summary: We propose a solution to reduce the dimensionality of the dataset using Principal Component Analysis (PCA)
PCA is well-established in the literature and has become one of the most useful tools for data modeling, compression, and visualization.
- Score: 1.7243216387069678
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Learning augmented is a machine learning concept built to improve the
performance of a method or model, such as enhancing its ability to predict and
generalize data or features, or testing the reliability of the method by
introducing noise and other factors. On the other hand, clustering is a
fundamental aspect of data analysis and has long been used to understand the
structure of large datasets. Despite its long history, the k-means algorithm
still faces challenges. One approach, as suggested by Ergun et al,is to use a
predictor to minimize the sum of squared distances between each data point and
a specified centroid. However, it is known that the computational cost of this
algorithm increases with the value of k, and it often gets stuck in local
minima. In response to these challenges, we propose a solution to reduce the
dimensionality of the dataset using Principal Component Analysis (PCA). It is
worth noting that when using k values of 10 and 25, the proposed algorithm
yields lower cost results compared to running it without PCA. "Principal
component analysis (PCA) is the problem of fitting a low-dimensional affine
subspace to a set of data points in a high-dimensional space. PCA is
well-established in the literature and has become one of the most useful tools
for data modeling, compression, and visualization."
Related papers
- Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Surprisal Driven $k$-NN for Robust and Interpretable Nonparametric
Learning [1.4293924404819704]
We shed new light on the traditional nearest neighbors algorithm from the perspective of information theory.
We propose a robust and interpretable framework for tasks such as classification, regression, density estimation, and anomaly detection using a single model.
Our work showcases the architecture's versatility by achieving state-of-the-art results in classification and anomaly detection.
arXiv Detail & Related papers (2023-11-17T00:35:38Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Influence of Swarm Intelligence in Data Clustering Mechanisms [0.0]
Nature inspired Swarm based algorithms are used for data clustering to cope with larger datasets with lack and inconsistency of data.
This paper reviews the performances of these new approaches and compares which is best for certain problematic situation.
arXiv Detail & Related papers (2023-05-07T08:40:50Z) - How to Use K-means for Big Data Clustering? [2.1165011830664677]
K-means is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model.
We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering.
arXiv Detail & Related papers (2022-04-14T08:18:01Z) - A Linearly Convergent Algorithm for Distributed Principal Component
Analysis [12.91948651812873]
This paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger's Algorithm (DSA)
The proposed algorithm is shown to converge linearly to a neighborhood of the true solution.
arXiv Detail & Related papers (2021-01-05T00:51:14Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Principal Ellipsoid Analysis (PEA): Efficient non-linear dimension
reduction & clustering [9.042239247913642]
This article focuses on improving upon PCA and k-means, by allowing nonlinear relations in the data and more flexible cluster shapes.
The key contribution is a new framework for Principal Analysis (PEA), defining a simple and computationally efficient alternative to PCA.
In a rich variety of real data clustering applications, PEA is shown to do as well as k-means for simple datasets, while dramatically improving performance in more complex settings.
arXiv Detail & Related papers (2020-08-17T06:25:50Z) - Learnable Subspace Clustering [76.2352740039615]
We develop a learnable subspace clustering paradigm to efficiently solve the large-scale subspace clustering problem.
The key idea is to learn a parametric function to partition the high-dimensional subspaces into their underlying low-dimensional subspaces.
To the best of our knowledge, this paper is the first work to efficiently cluster millions of data points among the subspace clustering methods.
arXiv Detail & Related papers (2020-04-09T12:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.