Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving
- URL: http://arxiv.org/abs/2510.23346v1
- Date: Mon, 27 Oct 2025 14:01:29 GMT
- Title: Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving
- Authors: Xinyu Wang, Jonas M. Kübler, Kailash Budhathoki, Yida Wang, Matthäus Kleindessner,
- Abstract summary: Block-diagonal LoRA allows for an alternative way of sharding LoRA adapters.<n>We demonstrate in extensive experiments that our block-diagonal LoRA approach is similarly parameter efficient as standard LoRA.<n>For example, we observe up to 1.79x (1.23x) end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x) the number of adapter parameters for Llama-3.1-8B
- Score: 10.097889959657277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When serving a single base LLM with several different LoRA adapters simultaneously, the adapters cannot simply be merged with the base model's weights as the adapter swapping would create overhead and requests using different adapters could not be batched. Rather, the LoRA computations have to be separated from the base LLM computations, and in a multi-device setup the LoRA adapters can be sharded in a way that is well aligned with the base model's tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA sharding strategy encounters some communication overhead, which may be small in theory, but can be large in practice. In this paper, we propose to constrain certain LoRA factors to be block-diagonal, which allows for an alternative way of sharding LoRA adapters that does not require any additional communication for the LoRA computations. We demonstrate in extensive experiments that our block-diagonal LoRA approach is similarly parameter efficient as standard LoRA (i.e., for a similar number of parameters it achieves similar downstream performance) and that it leads to significant end-to-end speed-up over S-LoRA. For example, when serving on eight A100 GPUs, we observe up to 1.79x (1.23x) end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x) the number of adapter parameters for Llama-3.1-8B.
Related papers
- Faster Than SVD, Smarter Than SGD: The OPLoRA Alternating Update [50.36542772932594]
Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights.<n>There is still a gap between full training with low-rank projections (SVDLoRA) and LoRA fine-tuning, indicating that LoRA steps can be further improved.
arXiv Detail & Related papers (2025-09-24T10:32:50Z) - Cross-LoRA: A Data-Free LoRA Transfer Framework across Heterogeneous LLMs [10.218401136555064]
Cross-LoRA is a framework for transferring LoRA modules between diverse base models.<n>Experiments show that Cross-LoRA achieves relative gains of up to 5.26% over base models.
arXiv Detail & Related papers (2025-08-07T10:21:08Z) - Kron-LoRA: Hybrid Kronecker-LoRA Adapters for Scalable, Sustainable Fine-tuning [0.8761302078860441]
We introduce textbfKron-LoRA, a hybrid adapter that combines Kronecker-structured factorization with low-rank LoRA compression.<n>Experiments on DistilBERT, Mistral-7B, LLaMA-2-7B, and LLaMA-3-8B show that Kron-LoRA matches or exceeds LoRA baselines with modest memory savings and only a 5-8% speed overhead.
arXiv Detail & Related papers (2025-08-04T00:02:15Z) - Activated LoRA: Fine-tuned LLMs for Intrinsics [6.057520371260868]
Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models.<n>We propose Activated LoRA, an adapter architecture which modifies the LoRA framework to only adapt weights for the tokens in the sequence after the aLoRA is invoked.
arXiv Detail & Related papers (2025-04-16T18:03:21Z) - LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements.<n>This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z) - Retrieval-Augmented Mixture of LoRA Experts for Uploadable Machine Learning [57.36978335727009]
Low-Rank Adaptation (LoRA) offers an efficient way to fine-tune large language models (LLMs)
In this paper, we propose a framework that adaptively retrieves and composes multiple LoRAs based on input prompts.
arXiv Detail & Related papers (2024-06-24T05:24:41Z) - LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters [11.23006032094776]
We introduce LoRA-XS, a novel fine-tuning method backed by a theoretical derivation.<n>LoRA-XS drastically reduces trainable parameters by incorporating a small, trainable weight matrix.<n>It can scale from a single parameter per module to arbitrarily large values, adapting to any storage or computational constraint.
arXiv Detail & Related papers (2024-05-27T19:07:13Z) - LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report [3.304521604464247]
Low Rank Adaptation (LoRA) has emerged as one of the most widely adopted methods for.
Efficient Fine-Tuning (PEFT) of Large Language Models (LLMs)
We aim to assess the viability of training and serving LLMs fine-tuned with LoRA in real-world applications.
arXiv Detail & Related papers (2024-04-29T04:01:45Z) - LoraRetriever: Input-Aware LoRA Retrieval and Composition for Mixed
Tasks in the Wild [76.67343971195267]
Low-Rank Adaptation (LoRA) provides an efficient solution for fine-tuning large language models (LLM)
LoraRetriever is a retrieve-then-compose framework that adaptively retrieves and composes multiple LoRAs according to the input prompts.
Experimental results indicate that LoraRetriever consistently outperforms the baselines.
arXiv Detail & Related papers (2024-02-15T15:02:46Z) - S-LoRA: Serving Thousands of Concurrent LoRA Adapters [59.490751234925206]
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks.
We present S-LoRA, a system designed for the scalable serving of many LoRA adapters.
arXiv Detail & Related papers (2023-11-06T17:26:17Z) - CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices [78.16679232748196]
We introduce a Compression-Aware LoRA (CA-LoRA) framework to transfer Large Language Models (LLMs) to other tasks.
Experiment results demonstrate that CA-LoRA outperforms the vanilla LoRA methods applied to a compressed LLM.
The source code of CA-LoRA is available at https://github.com/thunlp/CA-LoRA.
arXiv Detail & Related papers (2023-07-15T04:37:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.