Related papers: S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

URL: http://arxiv.org/abs/2412.06289v3
Date: Thu, 19 Dec 2024 18:47:54 GMT
Title: S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity
Authors: Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, Beidi Chen,
Abstract summary: We propose a family of Structured Sparse Fine-Tuning (S$2$FT) methods for LLMs.<n>S$2$FT accomplishes this by "selecting sparsely and computing densely"<n>We show that S$2$FT saves training memory up to 3$times$ and improves latency by 1.5-2.7$times$ compared to full FT.
Score: 39.679861450783605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of Structured Sparse Fine-Tuning (S$^{2}$FT) methods for LLMs, which concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability. S$^{2}$FT accomplishes this by "selecting sparsely and computing densely". It selects a few heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes weight matrices on both sides of the coupled structures in LLMs to connect the selected components in each layer into a dense submatrix. Finally, S$^{2}$FT performs in-place gradient updates on all submatrices. Through theoretical analysis and empirical results, our method prevents forgetting while simplifying optimization, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6% and 1.3% average improvements compared to LoRA, and surpasses full FT by 11.5% when generalizing to various domains after instruction tuning. Using our partial backpropagation algorithm, S$^{2}$FT saves training memory up to 3$\times$ and improves latency by 1.5-2.7$\times$ compared to full FT, while delivering an average 10% improvement over LoRA on both metrics. We further demonstrate that the weight updates in S$^{2}$FT can be decoupled into adapters, enabling effective fusion, fast switch, and efficient parallelism for serving multiple fine-tuned models.

Related papers

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z)
Dr.LLM: Dynamic Layer Routing in LLMs [55.11953638340419]
Dr.LLM is a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block.<n>On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average.
arXiv Detail & Related papers (2025-10-14T17:51:26Z)
MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation [28.079735905482096]
Low-Rank Adaptation (LoRA) has emerged as a dominant method in.<n>Low-Rank Adaptation (LoRA) has emerged as a dominant method in.<n>Low-Rank Adaptation (LoRA) has emerged as a dominant method in.<n>Low-Rank Adaptation (LoRA) has emerged as a dominant method in.<n>Low-Rank Adaptation (LoRA) has emerged as a dominant method in.<n>Low-Rank Adaptation (LoRA) has emerged as a dominant method in.<n>Low-Rank Adaptation (LoRA) has emerged as a dominant method in.<n>
arXiv Detail & Related papers (2025-10-07T15:06:46Z)
Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning [16.99490636203893]
We present textscRavan, an adaptive multi-head LoRA method that balances parameter efficiency and model expressivity.<n>Experiments on vision and language benchmarks show that textscRavan improves test accuracy by 2-8% over prior parameter-efficient baselines.
arXiv Detail & Related papers (2025-06-05T20:28:02Z)
GOLLuM: Gaussian Process Optimized LLMs -- Reframing LLM Finetuning through Bayesian Optimization [0.4037357056611557]
Large Language Models (LLMs) can encode complex relationships in their latent spaces. We introduce LLM-based deep kernels, jointly optimized with GPs to preserve the benefits of both. Our method nearly doubles the discovery rate of high-performing reactions compared to static LLM embeddings.
arXiv Detail & Related papers (2025-04-08T17:59:57Z)
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment [20.382810396966473]
Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs) Current methods optimize LoRA by initializing with static singular value decomposition subsets, leading to suboptimal leveraging of pre-trained knowledge. We propose underlineGreat LunderlineoRunderlineA Mixture-of-Experunderlinet (GOAT) GOAT integrates relevant priors using an SVD-structured MoE, and aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor
arXiv Detail & Related papers (2025-02-24T06:48:13Z)
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z)
Low-Rank Interconnected Adaptation across Layers [7.462568595335555]
We propose low-rank interconnected adaptation across layers (Lily)<n>This structure eliminates redundant per-layer $AB$ pairs, enabling higher-rank $Delta W$ with equal or fewer parameters.<n>Experiments across modalities, architectures, and model sizes demonstrate Lily's superior performance and efficiency.
arXiv Detail & Related papers (2024-07-13T17:03:16Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization [13.622268474310918]
ShiftAddLLM is an efficient multiplication-free model for large language models. It achieves perplexity improvements of 5.6 and 22.7 points at comparable or lower latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM.
arXiv Detail & Related papers (2024-06-10T02:47:55Z)
AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models [5.981614673186146]
We present a novel. -Efficient Fine-Tuning (PEFT) method, dubbed as Adaptive Freezing of Low Rank Adaptation (AFLoRA) Specifically, we add a parallel path of trainable low-rank matrices, namely a down-projection and an up-projection matrix, each of which is followed by a feature transformation vector. Our experimental results demonstrate that we can achieve state-of-the-art performance with an average improvement of up to $0.85%$ as evaluated on GLUE benchmark.
arXiv Detail & Related papers (2024-03-20T03:07:50Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters. We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values. We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z)
Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes [53.4856038354195]
Pre-trained large language models (LLMs) need fine-tuning to improve their responsiveness to natural language instructions. FedKSeed employs zeroth-order optimization with a finite set of random seeds. It significantly reduces transmission requirements between the server and clients to just a few random seeds.
arXiv Detail & Related papers (2023-12-11T13:03:21Z)
Generative Parameter-Efficient Fine-Tuning [8.481707805559589]
GIFT learns to generate the fine-tuned weights for a layer directly from its pretrained weights. We show this formulation bridges parameter-efficient fine-tuning and representation fine-tuning.
arXiv Detail & Related papers (2023-12-01T16:33:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.