Related papers: Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means

Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means

URL: http://arxiv.org/abs/2403.18766v1
Date: Wed, 27 Mar 2024 17:05:03 GMT
Title: Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means
Authors: Rustam Mussabayev, Ravil Mussabayev,
Abstract summary: This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, sampling, and competitive optimization to create a scalable variant designed for big data applications.
Score: 0.3069335774032178
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to create a scalable variant designed for big data applications. It addresses scalability and computation time challenges typically faced with traditional techniques. The algorithm adjusts sample sizes dynamically for each worker during execution, optimizing performance. Data from these sample sizes are continually analyzed, facilitating the identification of the most efficient configuration. By incorporating a competitive element among workers using different sample sizes, efficiency within the Big-means algorithm is further stimulated. In essence, the algorithm balances computational time and clustering quality by employing a stochastic, competitive sampling strategy in a parallel computing setting.

Related papers

An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
CAS Condensed and Accelerated Silhouette: An Efficient Method for Determining the Optimal K in K-Means Clustering [0.0]
This paper presents strategies for selecting the optimal value of k in clustering.<n>It focuses on achieving a balance between clustering precision and computational efficiency in complex data environments.<n>The proposed approach achieves up to 99 percent faster execution times on high-dimensional datasets.
arXiv Detail & Related papers (2025-07-11T05:03:16Z)
Accelerating Spectral Clustering under Fairness Constraints [56.865810822418744]
We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex functions (DC) framework.<n>We show that each associated subproblem can be solved efficiently, resulting in higher computational efficiency compared to prior work.
arXiv Detail & Related papers (2025-06-09T18:46:27Z)
Scaling LLM Inference with Optimized Sample Compute Allocation [56.524278187351925]
We propose OSCA, an algorithm to find an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration. OSCA is also shown to be effective in agentic beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration.
arXiv Detail & Related papers (2024-10-29T19:17:55Z)
Robust Clustering on High-Dimensional Data with Stochastic Quantization [0.0]
This paper addresses the limitations of conventional vector quantization algorithms. It investigates the Quantization (SQ) as an alternative for high-dimensionality computation.
arXiv Detail & Related papers (2024-09-03T17:13:55Z)
Sample-Efficient "Clustering and Conquer" Procedures for Parallel Large-Scale Ranking and Selection [0.0]
We modify the commonly used "divide and conquer" framework in parallel computing by adding a correlation-based clustering step. This seemingly simple modification achieves the optimal sample complexity reduction for a widely used class of efficient large-scale R&S procedures. In large-scale AI applications such as neural architecture search, our methods demonstrate superior performance.
arXiv Detail & Related papers (2024-02-03T15:56:03Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review [0.3069335774032178]
K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods.
arXiv Detail & Related papers (2023-10-15T12:35:27Z)
Federated Conditional Stochastic Optimization [110.513884892319]
Conditional optimization has found in a wide range of machine learning tasks, such as in-variant learning tasks, AUPRC, andAML. This paper proposes algorithms for distributed federated learning.
arXiv Detail & Related papers (2023-10-04T01:47:37Z)
Randomized Dimension Reduction with Statistical Guarantees [0.27195102129095]
This thesis explores some of such algorithms for fast execution and efficient data utilization. We focus on learning algorithms with various incorporations of data augmentation that improve generalization and distributional provably. Specifically, Chapter 4 presents a sample complexity analysis for data augmentation consistency regularization.
arXiv Detail & Related papers (2023-10-03T02:01:39Z)
Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z)
Let the Flows Tell: Solving Graph Combinatorial Optimization Problems with GFlowNets [86.43523688236077]
Combinatorial optimization (CO) problems are often NP-hard and out of reach for exact algorithms. GFlowNets have emerged as a powerful machinery to efficiently sample from composite unnormalized densities sequentially. In this paper, we design Markov decision processes (MDPs) for different problems and propose to train conditional GFlowNets to sample from the solution space.
arXiv Detail & Related papers (2023-05-26T15:13:09Z)
A sampling-based approach for efficient clustering in large datasets [0.8952229340927184]
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters.
arXiv Detail & Related papers (2021-12-29T19:15:20Z)
Stochastic batch size for adaptive regularization in deep network optimization [63.68104397173262]
We propose a first-order optimization algorithm incorporating adaptive regularization applicable to machine learning problems in deep learning framework. We empirically demonstrate the effectiveness of our algorithm using an image classification task based on conventional network models applied to commonly used benchmark datasets.
arXiv Detail & Related papers (2020-04-14T07:54:53Z)
Ensemble Slice Sampling: Parallel, black-box and gradient-free inference for correlated & multimodal distributions [0.0]
Slice Sampling has emerged as a powerful Markov Chain Monte Carlo algorithm that adapts to the characteristics of the target distribution with minimal hand-tuning. This paper introduces Ensemble Slice Sampling (ESS), a new class of algorithms that bypasses such difficulties by adaptively tuning the initial length scale. These affine-invariant algorithms are trivial to construct, require no hand-tuning, and can easily be implemented in parallel computing environments.
arXiv Detail & Related papers (2020-02-14T19:00:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.