Related papers: Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization

Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization

URL: http://arxiv.org/abs/2410.14548v1
Date: Fri, 18 Oct 2024 15:43:34 GMT
Title: Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization
Authors: Ravil Mussabayev, Rustam Mussabayev,
Abstract summary: K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. We propose a novel algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data.
Score: 0.3069335774032178
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. To address this limitation, we propose a novel heuristic algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data. Our approach is based on the sequential optimization of the partial objective function landscapes obtained by restricting the Minimum Sum-of-Squares Clustering (MSSC) formulation to random samples from the original big dataset. Within each landscape, systematically expanding neighborhoods of the currently best (incumbent) solution are explored by reinitializing all degenerate and a varying number of additional centroids. Extensive and rigorous experimentation on a large number of real-world datasets reveals that by transforming the traditional local search into a global one, our algorithm significantly enhances the accuracy and efficiency of K-means clustering in big data environments, becoming the new state of the art in the field.

Related papers

CAS Condensed and Accelerated Silhouette: An Efficient Method for Determining the Optimal K in K-Means Clustering [0.0]
This paper presents strategies for selecting the optimal value of k in clustering.<n>It focuses on achieving a balance between clustering precision and computational efficiency in complex data environments.<n>The proposed approach achieves up to 99 percent faster execution times on high-dimensional datasets.
arXiv Detail & Related papers (2025-07-11T05:03:16Z)
K*-Means: A Parameter-free Clustering Algorithm [55.20132267309382]
k*-means is a novel clustering algorithm that eliminates the need to set k or any other parameters.<n>It uses the minimum description length principle to automatically determine the optimal number of clusters, k*, by splitting and merging clusters.<n>We prove that k*-means is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where k is unknown.
arXiv Detail & Related papers (2025-05-17T08:41:07Z)
Graph Probability Aggregation Clustering [5.377020739388736]
We propose a graph-based fuzzy clustering algorithm that unifies the global clustering objective function with a local clustering constraint. The entire GPAC framework is formulated as a multi-constrained optimization problem, which can be solved using the Lagrangian method. Experiments conducted on synthetic, real-world, and deep learning datasets demonstrate that GPAC not only exceeds existing state-of-the-art methods in clustering performance but also excels in computational efficiency.
arXiv Detail & Related papers (2025-02-27T09:11:32Z)
Village-Net Clustering: A Rapid approach to Non-linear Unsupervised Clustering of High-Dimensional Data [0.0]
We develop an unsupervised clustering algorithm, we call "Village-Net" The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as "villages" We present extensive benchmarking on extant real-world datasets with known ground-truth labels to showcase its competitive performance.
arXiv Detail & Related papers (2025-01-16T06:56:43Z)
K-Means Clustering With Incomplete Data with the Use of Mahalanobis Distances [0.0]
We develop a unified K-means algorithm that incorporates Mahalanobis distances, instead of the traditional Euclidean distances. Our algorithm consistently outperforms both standalone imputation followed by K-means and K-Means with Incomplete Data.
arXiv Detail & Related papers (2024-10-31T00:05:09Z)
Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks. We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z)
Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem. This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z)
Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review [0.3069335774032178]
K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods.
arXiv Detail & Related papers (2023-10-15T12:35:27Z)
Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z)
Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation. Specifically, we construct distance matrix between data points by Butterworth filter. To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z)
Research on Efficient Fuzzy Clustering Method Based on Local Fuzzy Granular balls [67.33923111887933]
In this paper, the data is fuzzy iterated using granular-balls, and the membership degree of data only considers the two granular-balls where it is located. The formed fuzzy granular-balls set can use more processing methods in the face of different data scenarios.
arXiv Detail & Related papers (2023-03-07T01:52:55Z)
A Global Optimization Algorithm for K-Center Clustering of One Billion Samples [3.4998703934432682]
This paper presents a practical global optimization algorithm for the K-center clustering problem. It aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. Our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.
arXiv Detail & Related papers (2022-12-30T21:53:08Z)
How to Use K-means for Big Data Clustering? [2.1165011830664677]
K-means is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering.
arXiv Detail & Related papers (2022-04-14T08:18:01Z)
A sampling-based approach for efficient clustering in large datasets [0.8952229340927184]
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters.
arXiv Detail & Related papers (2021-12-29T19:15:20Z)
Local Learning Matters: Rethinking Data Heterogeneity in Federated Learning [61.488646649045215]
Federated learning (FL) is a promising strategy for performing privacy-preserving, distributed learning with a network of clients (i.e., edge devices)
arXiv Detail & Related papers (2021-11-28T19:03:39Z)
Too Much Information Kills Information: A Clustering Perspective [6.375668163098171]
We propose a simple, but novel approach for variance-based k-clustering tasks, including in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability.
arXiv Detail & Related papers (2020-09-16T01:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.