An Accurate and Efficient Large-scale Regression Method through Best
Friend Clustering
- URL: http://arxiv.org/abs/2104.10819v1
- Date: Thu, 22 Apr 2021 01:34:29 GMT
- Title: An Accurate and Efficient Large-scale Regression Method through Best
Friend Clustering
- Authors: Kun Li, Liang Yuan, Yunquan Zhang, Gongwei Chen
- Abstract summary: We propose a novel and simple data structure capturing the most important information among data samples.
We combine the clustering with regression techniques as a parallel library and utilize a hybrid structure of data and model parallelism to make predictions.
- Score: 10.273838113763192
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the data size in Machine Learning fields grows exponentially, it is
inevitable to accelerate the computation by utilizing the ever-growing large
number of available cores provided by high-performance computing hardware.
However, existing parallel methods for clustering or regression often suffer
from problems of low accuracy, slow convergence, and complex
hyperparameter-tuning. Furthermore, the parallel efficiency is usually
difficult to improve while striking a balance between preserving model
properties and partitioning computing workloads on distributed systems. In this
paper, we propose a novel and simple data structure capturing the most
important information among data samples. It has several advantageous
properties supporting a hierarchical clustering strategy that is irrelevant to
the hardware parallelism, well-defined metrics for determining optimal
clustering, balanced partition for maintaining the compactness property, and
efficient parallelization for accelerating computation phases. Then we combine
the clustering with regression techniques as a parallel library and utilize a
hybrid structure of data and model parallelism to make predictions. Experiments
illustrate that our library obtains remarkable performance on convergence,
accuracy, and scalability.
Related papers
- Sample-Efficient Clustering and Conquer Procedures for Parallel
Large-Scale Ranking and Selection [0.0]
In parallel computing environments, correlation-based clustering can achieve an $mathcalO(p)$ sample complexity reduction rate.
In large-scale AI applications such as neural architecture search, a screening-free version of our procedure surprisingly surpasses fully-sequential benchmarks in terms of sample efficiency.
arXiv Detail & Related papers (2024-02-03T15:56:03Z) - Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters.
We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z) - High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data [0.3069335774032178]
This paper introduces a novel formulation of the clustering problem, namely the Minimum Sum-of-Squares Clustering of Infinitely Tall Data (MSSC-ITD)
By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability.
arXiv Detail & Related papers (2023-11-08T08:02:52Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Data splitting improves statistical performance in overparametrized
regimes [0.0]
Distributed learning is a common strategy to reduce the overall training time by exploiting multiple computing devices.
We show that in this regime, data splitting has a regularizing effect, hence improving statistical performance and computational complexity.
arXiv Detail & Related papers (2021-10-21T08:10:56Z) - A New Parallel Adaptive Clustering and its Application to Streaming Data [0.0]
This paper presents a parallel adaptive clustering (PAC) algorithm to automatically classify data while simultaneously choosing a suitable number of classes.
We develop regularized set mik-means to efficiently cluster the results from the parallel threads.
We provide theoretical analysis and numerical experiments to characterize the performance of the method.
arXiv Detail & Related papers (2021-04-06T17:18:56Z) - Real-Time Regression with Dividing Local Gaussian Processes [62.01822866877782]
Local Gaussian processes are a novel, computationally efficient modeling approach based on Gaussian process regression.
Due to an iterative, data-driven division of the input space, they achieve a sublinear computational complexity in the total number of training points in practice.
A numerical evaluation on real-world data sets shows their advantages over other state-of-the-art methods in terms of accuracy as well as prediction and update speed.
arXiv Detail & Related papers (2020-06-16T18:43:31Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.