Fugu-MT 論文翻訳(概要): FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic

論文の概要: FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic

arxiv url: http://arxiv.org/abs/2510.24061v1
Date: Tue, 28 Oct 2025 04:44:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:36.789189
Title: FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic
Title（参考訳）: FALQON:低ビット浮動小数点算術によるLoRAファインチューニングの高速化
Authors: Kanghyun Choi, Hyeyoon Lee, SunJong Park, Dain Kwon, Jinho Lee,
Abstract要約: FP8のような低ビット浮動小数点(FP)フォーマットは、モデルトレーニングにおいて大きな加速とメモリ節約を提供する。本稿では,ローランク適応(LoRA)計算経路から量子化オーバーヘッドを除去する新しいフレームワークであるFALQONを提案する。 FALQONは、同じレベルの精度で既存の量子化LoRA法よりも約3$times$のトレーニングスピードアップを達成する。
参考スコア（独自算出の注目度）: 9.192731482247103
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Low-bit floating-point (FP) formats, such as FP8, provide significant acceleration and memory savings in model training thanks to native hardware support on modern GPUs and NPUs. However, we analyze that FP8 quantization offers speedup primarily for large-dimensional matrix multiplications, while inherent quantization overheads diminish speedup when applied to low-rank adaptation (LoRA), which uses small-dimensional matrices for efficient fine-tuning of large language models (LLMs). To address this limitation, we propose FALQON, a novel framework that eliminates the quantization overhead from separate LoRA computational paths by directly merging LoRA adapters into an FP8-quantized backbone during fine-tuning. Furthermore, we reformulate the forward and backward computations for merged adapters to significantly reduce quantization overhead, and introduce a row-wise proxy update mechanism that efficiently integrates substantial updates into the quantized backbone. Experimental evaluations demonstrate that FALQON achieves approximately a 3$\times$ training speedup over existing quantized LoRA methods with a similar level of accuracy, providing a practical solution for efficient large-scale model fine-tuning. Moreover, FALQON's end-to-end FP8 workflow removes the need for post-training quantization, facilitating efficient deployment. Code is available at https://github.com/iamkanghyunchoi/falqon.
Abstract（参考訳）: FP8のような低ビット浮動小数点(FP)フォーマットは、最新のGPUとNPUのネイティブハードウェアサポートのおかげで、モデルトレーニングにおいて大きな加速とメモリ節約を提供する。しかし,FP8量子化は,大規模言語モデル (LLM) の高速な微調整に小次元行列を用いるローランク適応 (LoRA) に適用した場合に,主成分の量子化オーバーヘッドがスピードアップを減少させるのに対して,主に大次元行列乗算に対して速度アップを提供する。この制限に対処するため、FALQONは、微調整中にLoRAアダプタを直接FP8量子化バックボーンにマージすることで、別のLoRA計算経路から量子化オーバーヘッドを除去する新しいフレームワークである。さらに、結合アダプタの前方および後方の計算を改定し、量子化オーバーヘッドを大幅に削減し、量子化バックボーンに実質的な更新を効率的に統合する行ワイドプロキシ更新機構を導入する。実験により、FALQON は既存の量子化 LoRA 法よりも約 3$\times$ のトレーニングスピードアップを同等の精度で達成し、より効率的な大規模モデル微調整のための実用的なソリューションを提供することが示された。さらに、FALQONのエンドツーエンドのFP8ワークフローは、トレーニング後の量子化の必要性を排除し、効率的なデプロイメントを容易にする。コードはhttps://github.com/iamkanghyunchoi/falqon.comで入手できる。

論文の概要: FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic

関連論文リスト