Related papers: SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity

SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity

URL: http://arxiv.org/abs/2506.16500v1
Date: Thu, 19 Jun 2025 17:53:34 GMT
Title: SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
Authors: Samir Khaki, Xiuyu Li, Junxian Guo, Ligeng Zhu, Chenfeng Xu, Konstantinos N. Plataniotis, Amir Yazdanbakhsh, Kurt Keutzer, Song Han, Zhijian Liu,
Abstract summary: We introduce SparseLoRA, a method that accelerates fine-tuning through contextual sparsity.<n>We show that SparseLoRA reduces computational cost by up to 2.2 times and a measured speedup of up to 1.6 times.
Score: 52.88892280536302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning LLMs is both computationally and memory-intensive. While parameter-efficient fine-tuning methods, such as QLoRA and DoRA, reduce the number of trainable parameters and lower memory usage, they do not decrease computational cost. In some cases, they may even slow down fine-tuning. In this paper, we introduce SparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsity. We propose a lightweight, training-free SVD sparsity estimator that dynamically selects a sparse subset of weights for loss and gradient computation. Also, we systematically analyze and address sensitivity across layers, tokens, and training steps. Our experimental results show that SparseLoRA reduces computational cost by up to 2.2 times and a measured speedup of up to 1.6 times while maintaining accuracy across various downstream tasks, including commonsense and arithmetic reasoning, code generation, and instruction following.

Related papers

Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights [75.83625828306839]
textbfDrag-and-Drop LLMs (textitDnD) eliminates per-task training by mapping a handful of unlabeled task prompts directly to LoRA weight updates.<n>A lightweight text encoder distills each prompt batch into condition embeddings, which are then transformed by a cascaded hyper-convolutional decoder into the full set of LoRA matrices.
arXiv Detail & Related papers (2025-06-19T15:38:21Z)
Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity.<n>LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning.<n>LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z)
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization [13.622268474310918]
ShiftAddLLM is an efficient multiplication-free model for large language models. It achieves perplexity improvements of 5.6 and 22.7 points at comparable or lower latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM.
arXiv Detail & Related papers (2024-06-10T02:47:55Z)
Sparse Matrix in Large Language Model Fine-tuning [1.9874264019909988]
We introduce a method for selecting sparse sub-matrices that aim to minimize the performance gap between PEFT vs. full fine-tuning.<n>In experiments, we demonstrate that our method consistently surpasses other PEFT baselines.<n>We also examine how the performance of LoRA and DoRA tends to plateau and decline as the number of trainable parameters increases.
arXiv Detail & Related papers (2024-05-24T13:12:14Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Run LoRA Run: Faster and Lighter LoRA Implementations [50.347242693025336]
LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers. This paper presents the RunLoRA framework for efficient implementations of LoRA. Experiments show up to 28% speedup on language modeling networks.
arXiv Detail & Related papers (2023-12-06T10:54:34Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
Acceleration of Subspace Learning Machine via Particle Swarm Optimization and Parallel Processing [23.33955958124822]
Subspace learning machine (SLM) has been proposed to offer higher performance in general classification and regression tasks. Performance improvement is reached at the expense of higher computational complexity. Experimental results show that the accelerated SLM method achieves a speed up factor of 577 in training time.
arXiv Detail & Related papers (2022-08-15T06:33:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.