Related papers: PaCA: Partial Connection Adaptation for Efficient Fine-Tuning

PaCA: Partial Connection Adaptation for Efficient Fine-Tuning

URL: http://arxiv.org/abs/2503.01905v2
Date: Tue, 11 Mar 2025 15:24:13 GMT
Title: PaCA: Partial Connection Adaptation for Efficient Fine-Tuning
Authors: Sunghyeon Woo, Sol Namkung, Sunwoo Lee, Inho Jeong, Beomseok Kim, Dongsuk Jeon,
Abstract summary: We propose PaCA, which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model.<n>Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios.
Score: 11.379377511067732
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.

Related papers

APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training. Various memory-efficient Scals have been proposed to reduce memory usage. They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z)
Skip2-LoRA: A Lightweight On-device DNN Fine-tuning Method for Low-cost Edge Devices [7.219286228148705]
This paper proposes Skip2-LoRA as a lightweight fine-tuning method for deep neural networks. In our approach, trainable LoRA adapters are inserted between the last layer and every other layer to enhance the network expressive power. Our results show that Skip2-LoRA reduces the fine-tuning time by 90.0% on average compared to the counterpart that has the same number of trainable parameters.
arXiv Detail & Related papers (2024-10-28T14:35:12Z)
SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training [13.283682311968752]
SSDTrain is an adaptive activation framework offloading to high-capacity GPU memory.<n>It is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed.<n>Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage.
arXiv Detail & Related papers (2024-08-19T14:09:48Z)
Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning [18.776903525210933]
We introduce an efficient fine-tuning method for ViTs called $textbfALaST$ ($textitAdaptive Layer Selection Fine-Tuning for Vision Transformers$) Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources.
arXiv Detail & Related papers (2024-08-16T11:27:52Z)
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. We identify and characterise the important components needed for effective model convergence using gradient descent. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z)
DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation [13.768426626459558]
We propose Dropping Backward Propagation (DropBP) to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules. It can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5x, and enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU.
arXiv Detail & Related papers (2024-02-27T14:51:11Z)
ConvLoRA and AdaBN based Domain Adaptation via Self-Training [4.006331916849688]
We propose Convolutional Low-Rank Adaptation (ConvLoRA) for multi-target domain adaptation. ConvLoRA freezes pre-trained model weights, adds trainable low-rank decomposition matrices to convolutional layers, and backpropagates the gradient. Our method has fewer trainable parameters and performs better or on-par with large independent fine-tuned networks.
arXiv Detail & Related papers (2024-02-07T15:43:50Z)
Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z)
mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs [5.735411578779657]
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is commonly used to adapt a base LLM to multiple downstream tasks. LoRA platforms enable developers to fine-tune multiple models and develop various domain-specific applications simultaneously. Existing model parallelism schemes suffer from high communication overhead and inefficient GPU utilization when training multiple LoRA tasks.
arXiv Detail & Related papers (2023-12-05T05:38:38Z)
Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions [9.068569788978854]
parameter-efficient transfer learning (PETL) methods have shown promising performance in adapting to downstream tasks with only a few trainable parameters. PETL methods in computer vision (CV) can be computationally expensive and require large amounts of memory and time cost during training. mathrmE3VA$ can save up to 62.2% training memory and 26.2% training time on average.
arXiv Detail & Related papers (2023-06-16T09:54:07Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Your representations are in the network: composable and parallel adaptation for large scale models [90.26965623489157]
InCA is a lightweight method for transfer learning that cross-attends to any activation layer of a pre-trained model. We show that, even when selecting a single top-scoring adapter, InCA achieves performance comparable to full fine-tuning.
arXiv Detail & Related papers (2023-03-07T18:12:24Z)
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models. PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task. We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.