DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation
- URL: http://arxiv.org/abs/2402.17812v2
- Date: Wed, 06 Nov 2024 05:33:16 GMT
- Title: DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation
- Authors: Sunghyeon Woo, Baeseong Park, Byeongwook Kim, Minjung Jo, Se Jung Kwon, Dongsuk Jeon, Dongsoo Lee,
- Abstract summary: We propose Dropping Backward Propagation (DropBP) to reduce computational costs and activation memory while maintaining accuracy.
DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules.
It can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5x, and enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU.
- Score: 13.768426626459558
- License:
- Abstract: Large language models (LLMs) have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and activation memory. In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules generated by undropped layers and residual connections. Additionally, DropBP calculates the sensitivity of each layer to assign an appropriate drop rate, thereby stabilizing the training process. DropBP is not only applicable to full fine-tuning but can also be orthogonally integrated with all types of PEFT by dropping layers during backward propagation. Specifically, DropBP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5x, and enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU. Furthermore, our DropBP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU. The code is available at https://github.com/WooSunghyeon/dropbp.
Related papers
- TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading [13.283682311968752]
TBA is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed.
We show that TBA effectively reduces 47% of the activation peak memory usage.
At the same time, TBA perfectly overlaps the I/O with the computation and incurs negligible performance overhead.
arXiv Detail & Related papers (2024-08-19T14:09:48Z) - Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning [18.776903525210933]
We introduce an efficient fine-tuning method for ViTs called $textbfALaST$ ($textitAdaptive Layer Selection Fine-Tuning for Vision Transformers$)
Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch.
We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources.
arXiv Detail & Related papers (2024-08-16T11:27:52Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Thinking Forward: Memory-Efficient Federated Finetuning of Language Models [21.438831528354513]
Finetuning large language models (LLMs) in federated learning settings requires excessive memory for resource-constrained devices.
In this paper, we introduce Spry, an FL algorithm that splits trainable weights of an LLM among participating clients.
Spry achieves a low memory footprint, high accuracy, and fast convergence.
arXiv Detail & Related papers (2024-05-24T13:37:48Z) - When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices [36.23767349592602]
Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design.
FL fails to work on low-memory AIoT devices due to its heavy memory usage.
We propose a federated foresight pruning method based on Neural Tangent Kernel (NTK), which can seamlessly integrate with federated BP-Free training frameworks.
arXiv Detail & Related papers (2024-05-08T02:24:09Z) - Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - DIVISION: Memory Efficient Training via Dual Activation Precision [60.153754740511864]
State-of-the-art work combines a search of quantization bit-width with the training, which makes the procedure complicated and less transparent.
We propose a simple and effective method to compress DNN training.
Experiment results show DIVISION has better comprehensive performance than state-of-the-art methods, including over 10x compression of activation maps and competitive training throughput, without loss of model accuracy.
arXiv Detail & Related papers (2022-08-05T03:15:28Z) - Stochastic Backpropagation: A Memory Efficient Strategy for Training
Video Models [42.31924917984774]
We propose a memory efficient method, named Backpropagation (SBP), for training deep neural networks on videos.
Experiments show that SBP can be applied to a wide range of models for video tasks, leading to up to 80.0% GPU memory saving and 10% training speedup with less than 1% accuracy drop on action recognition and temporal action detection.
arXiv Detail & Related papers (2022-03-31T02:24:53Z) - GDP: Stabilized Neural Network Pruning via Gates with Differentiable
Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest.
GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel.
Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.