Fugu-MT 論文翻訳(概要): LoRAFusion: Efficient LoRA Fine-Tuning for LLMs

論文の概要: LoRAFusion: Efficient LoRA Fine-Tuning for LLMs

arxiv url: http://arxiv.org/abs/2510.00206v1
Date: Tue, 30 Sep 2025 19:26:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.226535
Title: LoRAFusion: Efficient LoRA Fine-Tuning for LLMs
Title（参考訳）: LoRAFusion: LLMのための効率的なLoRAファインチューニング
Authors: Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, Gennady Pekhimenko,
Abstract要約: Low-Rank Adaptation (LoRA) はLarge Language Models (LLM) のためのPEFT (Efficient Fine-Tuning) メソッドの先駆けとなった。 LLMのための効率的なLoRA微調整システムであるLoRAFusionを紹介する。 LoRAFusionはMegatron-LMと比較して最大1.96times$(平均1.47times$)エンドツーエンドのスピードアップを達成し、mLoRAよりも最大1.46times$(平均1.29times$)改善する。
参考スコア（独自算出の注目度）: 7.13923757932177
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Low-Rank Adaptation (LoRA) has become the leading Parameter-Efficient Fine-Tuning (PEFT) method for Large Language Models (LLMs), as it significantly reduces GPU memory usage while maintaining competitive fine-tuned model quality on downstream tasks. Despite these benefits, we identify two key inefficiencies in existing LoRA fine-tuning systems. First, they incur substantial runtime overhead due to redundant memory accesses on large activation tensors. Second, they miss the opportunity to concurrently fine-tune multiple independent LoRA adapters that share the same base model on the same set of GPUs. This leads to missed performance gains such as reduced pipeline bubbles, better communication overlap, and improved GPU load balance. To address these issues, we introduce LoRAFusion, an efficient LoRA fine-tuning system for LLMs. At the kernel level, we propose a graph-splitting method that fuses memory-bound operations. This design eliminates unnecessary memory accesses and preserves the performance of compute-bound GEMMs without incurring the cost of recomputation or synchronization. At the scheduling level, LoRAFusion introduces an adaptive batching algorithm for multi-job fine-tuning. It first splits LoRA adapters into groups to intentionally stagger batch execution across jobs, and then solves a bin-packing problem within each group to generate balanced, dependency-aware microbatches. LoRAFusion achieves up to $1.96\times$ ($1.47\times$ on average) end-to-end speedup compared to Megatron-LM, and up to $1.46\times$ ($1.29\times$ on average) improvement over mLoRA, the state-of-the-art multi-LoRA fine-tuning system. Our fused kernel achieves up to $1.39\times$ ($1.27\times$ on average) kernel performance improvement and can directly serve as a plug-and-play replacement in existing LoRA systems. We open-source LoRAFusion at https://github.com/CentML/lorafusion.
Abstract（参考訳）: Low-Rank Adaptation (LoRA) は大規模言語モデル (LLM) におけるパラメータ効率の良い細調整 (PEFT) 手法の先駆けとなった。これらの利点にもかかわらず、既存のLoRAファインチューニングシステムにおける2つの重要な非効率性を特定する。まず、大きなアクティベーションテンソル上の冗長なメモリアクセスのために、実行時にかなりのオーバーヘッドが発生する。第二に、同じGPUセット上で同じベースモデルを共有する複数の独立したLoRAアダプタを同時に微調整する機会を逃している。これにより、パイプラインバブルの削減、通信のオーバーラップの改善、GPUロードバランシングの改善など、パフォーマンスの向上が損なわれる。これらの問題に対処するために, LLM のための効率的な LoRA 微調整システムである LoRAFusion を導入する。カーネルレベルでは,メモリバウンド演算を融合するグラフ分割法を提案する。この設計は不要なメモリアクセスを排除し、再計算や同期のコストを発生させることなく、計算バウンドGEMMの性能を維持する。スケジューリングレベルでは、LoRAFusionはマルチジョブファインチューニングのための適応バッチアルゴリズムを導入している。最初はLoRAアダプタをグループに分割して、ジョブ間で意図的にバッチ実行をステージングし、その後、各グループ内のビンパッケージ問題を解決して、バランスの取れた依存性を意識したマイクロバッチを生成する。 LoRAFusionはMegatron-LMと比較して最大で1.96\times$$1.47\times$、最先端のマルチLoRA微調整システムであるmLoRAよりも最大で1.46\times$$1.29\times$である。我々の融合カーネルは、最大で1.39\times$ (1.27\times$ on average) のカーネル性能向上を実現し、既存のLoRAシステムでは直接プラグアンドプレイの代替として機能する。 LoRAFusionはhttps://github.com/CentML/lorafusion.comでオープンソース化しました。

論文の概要: LoRAFusion: Efficient LoRA Fine-Tuning for LLMs

関連論文リスト