Fugu-MT 論文翻訳(概要): Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

論文の概要: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

arxiv url: http://arxiv.org/abs/2601.16991v2
Date: Wed, 28 Jan 2026 10:53:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 02:21:38.464338
Title: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models
Title（参考訳）: 大規模言語モデルの効率的な微調整のためのスポーサリティを考慮した低ランク表現
Authors: Longteng Zhang, Sen Wu, Shuai Hou, Zhengyu Qing, Zhuo Zheng, Danning Ke, Qihong Lin, Qiang Wang, Shaohuai Shi, Xiaowen Chu,
Abstract要約: 低ランク適応(LoRA)は、重み付けを分解することでトレーニング可能なパラメータを減らすが、基礎となる重み付けは高いストレージと計算コストを課す。 Sparsity-Aware Low-Rank Representationは,スパースプルーニングによる低ランク適応を統一する新しい微調整パラダイムである。
参考スコア（独自算出の注目度）: 19.288371639304504
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA's performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of $(1 - r/\min(d,k))$. To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50\% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by $2\times$, and delivers up to a $1.7\times$ inference speedup.
Abstract（参考訳）: 大規模な事前学習された言語モデルを下流タスクに適応させるには、数百万のパラメータを微調整したり、コストのかかる重み付けの更新を行う必要があり、リソース制約のある環境での使用を妨げます。低ランク適応(LoRA)は、重み付けを分解することでトレーニング可能なパラメータを減らすが、基礎となる重み付けは高いストレージと計算コストを課す。マグニチュードベースのプルーニングはスパースモデルが得られるが、通常、鼻で適用するとLoRAの性能が低下する。本稿では,厳密な平均二乗エラー枠組みの下でスパースプルーニングによる低ランク適応を統一する新しいファインチューニングパラダイムであるSALRを紹介する。凍結したベースウェイトのみを静的にプルーニングすることは、プルーニングエラーバウンドを最小限に抑え、トルーニングされたSVDローランクアダプタを介して廃棄された残留情報を復元する。ハードウェア効率を最大化するために、複数の低ランクアダプタを1つの連結GEMMに融合し、2段パイプラインデコード+GEMM設計によるビットマップ符号化を採用し、真のモデル圧縮と高速化を実現する。経験的に、SALRは、GSM8KとMMLUのLoRAの性能にマッチしながら、様々なLLM上で50倍の間隔を実現し、モデルサイズを2\times$に減らし、最大1.7\times$推論スピードアップを提供する。

論文の概要: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

関連論文リスト