Fugu-MT 論文翻訳(概要): Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving

論文の概要: Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving

arxiv url: http://arxiv.org/abs/2510.23346v1
Date: Mon, 27 Oct 2025 14:01:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 19:54:32.612142
Title: Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving
Title（参考訳）: テンソルパラレルLORAにおける通信オーバヘッド除去のためのブロック対角ロラ
Authors: Xinyu Wang, Jonas M. Kübler, Kailash Budhathoki, Yida Wang, Matthäus Kleindessner,
Abstract要約: ブロック対角のLoRAは、LoRAアダプタをシャーディングする代替の方法を可能にする。ブロック対角法 LoRA アプローチが標準 LoRA と同様のパラメータ効率であることを示す。例えば、Llama-3.1-70B のアダプタパラメータの数が 0.87x (1.74x) で、Llama-3.1-8B のアダプタパラメータの数が 0.86x (1.73x) で、最大 1.63x (1.3x) のエンドツーエンドスピードアップを 0.86x (1.73x) で観測する。
参考スコア（独自算出の注目度）: 10.097889959657277
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When serving a single base LLM with several different LoRA adapters simultaneously, the adapters cannot simply be merged with the base model's weights as the adapter swapping would create overhead and requests using different adapters could not be batched. Rather, the LoRA computations have to be separated from the base LLM computations, and in a multi-device setup the LoRA adapters can be sharded in a way that is well aligned with the base model's tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA sharding strategy encounters some communication overhead, which may be small in theory, but can be large in practice. In this paper, we propose to constrain certain LoRA factors to be block-diagonal, which allows for an alternative way of sharding LoRA adapters that does not require any additional communication for the LoRA computations. We demonstrate in extensive experiments that our block-diagonal LoRA approach is similarly parameter efficient as standard LoRA (i.e., for a similar number of parameters it achieves similar downstream performance) and that it leads to significant end-to-end speed-up over S-LoRA. For example, when serving on eight A100 GPUs, we observe up to 1.79x (1.23x) end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x) the number of adapter parameters for Llama-3.1-8B.
Abstract（参考訳）: 複数のLoRAアダプタを同時に使用する場合、アダプタ交換がオーバーヘッドを発生させ、異なるアダプタを使用した要求をバッチ化できないため、アダプタはベースモデルの重みと簡単にマージできない。むしろ、LoRAの計算をベースLLM計算から切り離す必要があり、マルチデバイス構成では、S-LoRAで提案されているように、ベースモデルのテンソル並列実行とよく一致した方法でLoRAアダプタをシャーディングすることができる。しかし、S-LoRAシャーディング戦略は、理論上は小さいが実際は大きいかもしれない通信オーバーヘッドに遭遇する。本稿では,ブロック対角方向のLoRAパラメータを制限し,LoRA演算に余分な通信を必要としないLoRAアダプタをシャーディングする方法を提案する。我々は,ブロック対角方向のLoRAアプローチが標準のLoRAと同様にパラメータ効率が良く,S-LoRA上でのエンド・ツー・エンドの高速化につながることを示す。例えば、8つのA100 GPUで使用する場合、Llama-3.1-70Bのアダプタパラメータの数を最大0.87x (1.74x)、Llama-3.1-8Bのアダプタパラメータの数を最大1.63x (1.3x)、Llama-3.1-8Bのアダプタパラメータの数を最大0.86x (1.73x)まで観測する。

論文の概要: Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving

関連論文リスト