Pipelined Backpropagation at Scale: Training Large Models without
Batches
- URL: http://arxiv.org/abs/2003.11666v3
- Date: Sat, 10 Apr 2021 00:50:11 GMT
- Title: Pipelined Backpropagation at Scale: Training Large Models without
Batches
- Authors: Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, Urs
K\"oster
- Abstract summary: We evaluate the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training algorithm.
We show that appropriate normalization and small batch sizes can also aid training.
- Score: 0.9580895202050946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: New hardware can substantially increase the speed and efficiency of deep
neural network training. To guide the development of future hardware
architectures, it is pertinent to explore the hardware and machine learning
properties of alternative training algorithms. In this work we evaluate the use
of small batch, fine-grained Pipelined Backpropagation, an asynchronous
pipeline parallel training algorithm that has significant hardware advantages.
We introduce two methods, Spike Compensation and Linear Weight Prediction, that
effectively mitigate the downsides caused by the asynchronicity of Pipelined
Backpropagation and outperform existing techniques in our setting. We show that
appropriate normalization and small batch sizes can also aid training. With our
methods, fine-grained Pipelined Backpropagation using a batch size of one can
match the accuracy of SGD for multiple networks trained on CIFAR-10 and
ImageNet. Simple scaling rules allow the use of existing hyperparameters for
traditional training without additional tuning.
Related papers
- ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation [2.0181279529015925]
ReCycle is a system designed for efficient training in the presence of failures.
It exploits the inherent functional redundancy in distributed training systems.
We show it achieves high training throughput under multiple failures.
arXiv Detail & Related papers (2024-05-22T21:35:56Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - An End-to-End Network Pruning Pipeline with Sparsity Enforcement [0.0]
We develop an end-to-end training pipeline that befits neural network pruning and sparsification at all stages of training.
We conduct experiments utilizing combinations of these methods, in addition to different techniques used in the pruning step.
arXiv Detail & Related papers (2023-12-04T06:11:39Z) - PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent
Weight Prediction [37.05698088730229]
Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead.
"1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches.
We propose an-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training.
arXiv Detail & Related papers (2023-12-01T01:52:38Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct
Feedback Alignment [26.65651157173834]
We present a photonic accelerator for Direct Feedback Alignment, able to compute random projections with trillions of parameters.
This is a significant step towards building scalable hardware, able to go beyond backpropagation.
arXiv Detail & Related papers (2020-12-11T14:20:45Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z) - Pipelined Training with Stale Weights of Deep Convolutional Neural
Networks [0.1921787217122713]
We explore the impact of stale weights on the statistical efficiency and performance in a pipelined backpropagation scheme.
We show that when pipelining is limited to early layers in a network, training with stale weights converges and results in models with comparable inference accuracies.
We propose combining pipelined and non-pipelined training in a hybrid scheme to address this drop.
arXiv Detail & Related papers (2019-12-29T15:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.