High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data
- URL: http://arxiv.org/abs/2311.04517v5
- Date: Tue, 25 Jun 2024 10:49:06 GMT
- Title: High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data
- Authors: Ravil Mussabayev, Rustam Mussabayev,
- Abstract summary: This paper introduces a novel formulation of the clustering problem, namely the Minimum Sum-of-Squares Clustering of Infinitely Tall Data (MSSC-ITD)
By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability.
- Score: 0.3069335774032178
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper introduces a novel formulation of the clustering problem, namely the Minimum Sum-of-Squares Clustering of Infinitely Tall Data (MSSC-ITD), and presents HPClust, an innovative set of hybrid parallel approaches for its effective solution. By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability. In contrast to vanilla data parallelism, which only accelerates processing time through the MapReduce framework, our approach unlocks superior performance by leveraging the multi-strategy competitive-cooperative parallelism and intricate properties of the objective function landscape. Unlike other available algorithms that struggle to scale, our algorithm is inherently parallel in nature, improving solution quality through increased scalability and parallelism, and outperforming even advanced algorithms designed for small and medium-sized datasets. Our evaluation of HPClust, featuring four parallel strategies, demonstrates its superiority over traditional and cutting-edge methods by offering better performance in the key metrics. These results also show that parallel processing not only enhances the clustering efficiency, but the accuracy as well. Additionally, we explore the balance between computational efficiency and clustering quality, providing insights into optimal parallel strategies based on dataset specifics and resource availability. This research advances our understanding of parallelism in clustering algorithms, demonstrating that a judicious hybridization of advanced parallel approaches yields optimal results for MSSC-ITD. Experiments on synthetic data further confirm HPClust's exceptional scalability and robustness to noise.
Related papers
- Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means [0.3069335774032178]
This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology.
The proposed method efficiently integrates parallel processing, sampling, and competitive optimization to create a scalable variant designed for big data applications.
arXiv Detail & Related papers (2024-03-27T17:05:03Z) - Sample-Efficient Clustering and Conquer Procedures for Parallel
Large-Scale Ranking and Selection [0.0]
In parallel computing environments, correlation-based clustering can achieve an $mathcalO(p)$ sample complexity reduction rate.
In large-scale AI applications such as neural architecture search, a screening-free version of our procedure surprisingly surpasses fully-sequential benchmarks in terms of sample efficiency.
arXiv Detail & Related papers (2024-02-03T15:56:03Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Coverage and Capacity Optimization in STAR-RISs Assisted Networks: A
Machine Learning Approach [102.00221938474344]
A novel model is proposed for the coverage and capacity optimization of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs) assisted networks.
A loss function-based update strategy is the core point, which is able to calculate weights for both loss functions of coverage and capacity by a min-norm solver at each update.
The numerical results demonstrate that the investigated update strategy outperforms the fixed weight-based MO algorithms.
arXiv Detail & Related papers (2022-04-13T13:52:22Z) - ES-Based Jacobian Enables Faster Bilevel Optimization [53.675623215542515]
Bilevel optimization (BO) has arisen as a powerful tool for solving many modern machine learning problems.
Existing gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations.
We propose a novel BO algorithm, which adopts Evolution Strategies (ES) based method to approximate the response Jacobian matrix in the hypergradient of BO.
arXiv Detail & Related papers (2021-10-13T19:36:50Z) - An Accurate and Efficient Large-scale Regression Method through Best
Friend Clustering [10.273838113763192]
We propose a novel and simple data structure capturing the most important information among data samples.
We combine the clustering with regression techniques as a parallel library and utilize a hybrid structure of data and model parallelism to make predictions.
arXiv Detail & Related papers (2021-04-22T01:34:29Z) - Progressive Batching for Efficient Non-linear Least Squares [31.082253632197023]
Most improvements of the basic Gauss-Newton tackle convergence guarantees or leverage the sparsity of the underlying problem structure for computational speedup.
Our work borrows ideas from both machine learning and statistics, and we present an approach for non-linear least-squares that guarantees convergence while at the same time significantly reduces the required amount of computation.
arXiv Detail & Related papers (2020-10-21T13:00:04Z) - Multi-View Spectral Clustering with High-Order Optimal Neighborhood
Laplacian Matrix [57.11971786407279]
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data.
This paper proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix.
Our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base.
arXiv Detail & Related papers (2020-08-31T12:28:40Z) - Simple and Scalable Parallelized Bayesian Optimization [2.512827436728378]
We propose a simple and scalable BO method for asynchronous parallel settings.
Experiments are carried out with a benchmark function and hyperparameter optimization of multi-layer perceptrons.
arXiv Detail & Related papers (2020-06-24T10:25:27Z) - Parallelization Techniques for Verifying Neural Networks [52.917845265248744]
We introduce an algorithm based on the verification problem in an iterative manner and explore two partitioning strategies.
We also introduce a highly parallelizable pre-processing algorithm that uses the neuron activation phases to simplify the neural network verification problems.
arXiv Detail & Related papers (2020-04-17T20:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.