Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers
- URL: http://arxiv.org/abs/2110.07029v1
- Date: Wed, 13 Oct 2021 20:58:15 GMT
- Title: Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers
- Authors: Yujing Ma, Florin Rusu, Kesheng Wu, Alexander Sim
- Abstract summary: We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
- Score: 65.60007071024629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by extreme multi-label classification applications, we consider
training deep learning models over sparse data in multi-GPU servers. The
variance in the number of non-zero features across training batches and the
intrinsic GPU heterogeneity combine to limit accuracy and increase the time to
convergence. We address these challenges with Adaptive SGD, an adaptive elastic
model averaging stochastic gradient descent algorithm for heterogeneous
multi-GPUs that is characterized by dynamic scheduling, adaptive batch size
scaling, and normalized model merging. Instead of statically partitioning
batches to GPUs, batches are routed based on the relative processing speed.
Batch size scaling assigns larger batches to the faster GPUs and smaller
batches to the slower ones, with the goal to arrive at a steady state in which
all the GPUs perform the same number of model updates. Normalized model merging
computes optimal weights for every GPU based on the assigned batches such that
the combined model achieves better accuracy. We show experimentally that
Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy
and is scalable with the number of GPUs.
Related papers
- Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms.
While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking.
We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z) - Distributed Extra-gradient with Optimal Complexity and Communication
Guarantees [60.571030754252824]
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local dual vectors.
Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient.
We propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs.
arXiv Detail & Related papers (2023-08-17T21:15:04Z) - DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training [18.52206409432894]
DistTGL is an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
arXiv Detail & Related papers (2023-07-14T22:52:27Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - GPU-Accelerated Primal Learning for Extremely Fast Large-Scale
Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON.
We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.