PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent
Weight Prediction
- URL: http://arxiv.org/abs/2312.00839v2
- Date: Tue, 5 Dec 2023 07:16:55 GMT
- Title: PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent
Weight Prediction
- Authors: Lei Guan, Dongsheng Li, Jiye Liang, Wenjian Wang, Xicheng Lu
- Abstract summary: Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead.
"1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches.
We propose an-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training.
- Score: 37.05698088730229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Asynchronous pipeline model parallelism with a "1F1B" (one forward, one
backward) schedule generates little bubble overhead and always provides quite a
high throughput. However, the "1F1B" schedule inevitably leads to weight
inconsistency and weight staleness issues due to the cross-training of
different mini-batches across GPUs. To simultaneously address these two
problems, in this paper, we propose an optimizer-dependent weight prediction
strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight
of our proposal is that we employ a weight prediction strategy in the forward
pass to ensure that each mini-batch uses consistent and staleness-free weights
to compute the forward pass. To be concrete, we first construct the weight
prediction scheme based on the update rule of the used optimizer when training
the deep neural network models. Then throughout the "1F1B" pipelined training,
each mini-batch is mandated to execute weight prediction ahead of the forward
pass, subsequently employing the predicted weights to perform the forward pass.
As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and
generates pretty high throughput, and 2) can ensure effective parameter
learning regardless of the type of the used optimizer. To verify the
effectiveness of our proposal, we conducted extensive experimental evaluations
using eight different deep-learning models spanning three machine-learning
tasks including image classification, sentiment analysis, and machine
translation. The experiment results demonstrate that PipeOptim outperforms the
popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and
SpecTrain. The code of PipeOptim can be accessible at
https://github.com/guanleics/PipeOptim.
Related papers
- Forecast-PEFT: Parameter-Efficient Fine-Tuning for Pre-trained Motion Forecasting Models [68.23649978697027]
Forecast-PEFT is a fine-tuning strategy that freezes the majority of the model's parameters, focusing adjustments on newly introduced prompts and adapters.
Our experiments show that Forecast-PEFT outperforms traditional full fine-tuning methods in motion prediction tasks.
Forecast-FT further improves prediction performance, evidencing up to a 9.6% enhancement over conventional baseline methods.
arXiv Detail & Related papers (2024-07-28T19:18:59Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Zero Bubble Pipeline Parallelism [6.7021820542657045]
Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit.
We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism.
arXiv Detail & Related papers (2023-11-30T10:40:34Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z) - Sample-Efficient Optimisation with Probabilistic Transformer Surrogates [66.98962321504085]
This paper investigates the feasibility of employing state-of-the-art probabilistic transformers in Bayesian optimisation.
We observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation.
We introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance.
arXiv Detail & Related papers (2022-05-27T11:13:17Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - Pipelined Backpropagation at Scale: Training Large Models without
Batches [0.9580895202050946]
We evaluate the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training algorithm.
We show that appropriate normalization and small batch sizes can also aid training.
arXiv Detail & Related papers (2020-03-25T22:26:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.