Automatic Cross-Replica Sharding of Weight Update in Data-Parallel
Training
- URL: http://arxiv.org/abs/2004.13336v1
- Date: Tue, 28 Apr 2020 07:13:50 GMT
- Title: Automatic Cross-Replica Sharding of Weight Update in Data-Parallel
Training
- Authors: Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake
Hechtman, Shibo Wang
- Abstract summary: This paper presents an approach to automatically shard the weight update across replicas.
We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs.
- Score: 12.36664837965624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In data-parallel synchronous training of deep neural networks, different
devices (replicas) run the same program with different partitions of the
training batch, but weight update computation is repeated on all replicas,
because the weights do not have a batch dimension to partition. This can be a
bottleneck for performance and scalability in typical language models with
large weights, and models with small per-replica batch size which is typical in
large-scale training. This paper presents an approach to automatically shard
the weight update computation across replicas with efficient communication
primitives and data formatting, using static analysis and transformations on
the training computation graph. We show this technique achieves substantial
speedups on typical image and language models on Cloud TPUs, requiring no
change to model code. This technique helps close the gap between traditionally
expensive (ADAM) and cheap (SGD) optimizers, as they will only take a small
part of training step time and have similar peak memory usage. It helped us to
achieve state-of-the-art training performance in Google's MLPerf 0.6
submission.
Related papers
- OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance [35.40320275366383]
Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices.
We rebalanced the computational loads from data, model, and memory perspectives to address this issue.
Our method's efficacy and generalizability were further demonstrated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized
Language Model Finetuning Using Shared Randomness [86.61582747039053]
Language model training in distributed settings is limited by the communication cost of exchanges.
We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
arXiv Detail & Related papers (2023-06-16T17:59:51Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - MAP: Memory-aware Automated Intra-op Parallel Training For Foundation
Models [15.256207550970501]
We introduce MAP, a compiler built upon PyTorch to implement Memory-aware Automated Parallelization.
Compared with existing methods, MAP provides an easy-to-use symbolic profiler to generate memory and computing statistics of an arbitrary PyTorch model.
arXiv Detail & Related papers (2023-02-06T07:22:49Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Layered gradient accumulation and modular pipeline parallelism: fast and
efficient training of large language models [0.0]
We analyse the shortest possible training time for different configurations of distributed training.
We introduce two new methods, textitlayered gradient accumulation and textitmodular pipeline parallelism, which together cut the shortest training time by half.
arXiv Detail & Related papers (2021-06-04T19:21:49Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Stochastic Weight Averaging in Parallel: Large-Batch Training that
Generalizes Well [7.262048441360133]
We propose Weight Averaging in Parallel (SWAP) to accelerate DNN training.
Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models computed independently and in parallel.
The resulting models generalize equally well as those trained with small mini-batches but are produced in a substantially shorter time.
arXiv Detail & Related papers (2020-01-07T23:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.