Fugu-MT 論文翻訳(概要): AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

論文の概要: AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

arxiv url: http://arxiv.org/abs/2512.15764v1
Date: Fri, 12 Dec 2025 09:44:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-19 18:10:31.677416
Title: AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs
Title（参考訳）: AdaGradSelect:SLMの高効率微調整のための適応勾配誘導層選択法
Authors: Anshul Kumar, Gagan Raj Gupta, Manisha Chawla,
Abstract要約: 大きな言語モデル(LLM)は、多くのNLPタスクをうまく実行できますが、完全な微調整は高価で、多くのメモリを必要とします。 AdaGradSelectは、勾配に基づいて、どのトランスフォーマーブロックを更新するかを選択する適応的な方法である。実験によると、AdaGradSelectは12%高速で、GPUメモリを35%削減している。
参考スコア（独自算出の注目度）: 0.6652641137999891
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods.
Abstract（参考訳）: 大きな言語モデル(LLM)は、多くのNLPタスクをうまく実行できますが、完全な微調整は高価で、多くのメモリを必要とします。 LoRAのようなパラメータ効率の良いファインチューニング(PEFT)手法は、凍結モデルウェイトに小さな低ランク更新を追加することで、このコストを削減する。しかし、これらの手法は訓練を限られた部分空間に制限し、性能を低下させることがある。効率がさらに向上するSmall Language Models (SLM) では、勾配に基づいてどのトランスフォーマーブロックを更新するかを選択する適応的手法であるAdaGradSelectを導入する。初期の観測では、高い勾配ノルムを持つ変圧器ブロックのみを更新することで、完全な微調整に近い性能が得られることが示された。この洞察に基づいて、AdaGradSelectは、どのブロックをトレーニングするかを適応的に選択する。ディリクレをベースとしたサンプリングは、過去にどれだけの頻度でブロックが更新されたか、そしてエプシロンとグレーディの探索戦略に依存している。これにより、初期の訓練で異なるブロックを探索し、後期の時代の最も重要なブロックに徐々に焦点を合わせることができる。実験によると、AdaGradSelectは12%高速でGPUメモリを35%削減し、フル微調整に近いパフォーマンスを実現している。 GSM8Kデータセットでは、Qwen2.5-0.5B、LLaMA3.2-1B、Phi4-mini-3.8Bといったモデルで平均でLoRA(256ランク)を上回っている。また、MATHデータセット上でも同様の精度を達成する。全体として、AdaGradSelectは従来の微調整メソッドよりも効率的でリソース効率の良い代替手段を提供する。

論文の概要: AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

関連論文リスト