Fugu-MT 論文翻訳(概要): Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

論文の概要: Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

arxiv url: http://arxiv.org/abs/2605.29303v1
Date: Thu, 28 May 2026 03:36:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.638509
Title: Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models
Title（参考訳）: エントロピーKLディバージェンスに基づくToken Masking:大規模言語モデルの選択的微調整のための新しいアプローチ
Authors: Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen,
Abstract要約: 改良された微調整と強化学習は、大規模言語モデルの訓練後の標準パラダイムとなっている。 EKSFT(Entropy-KL Selective Fine-Tuning)は,参照モデルから高いエントロピーまたは高いKLの発散を示すトークンを選択的にマスクする。数学的推論ベンチマークに関する実証的な評価は、EKSFTが標準SFTを一貫して上回っていることを示している。
参考スコア（独自算出の注目度）: 52.11240605311707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.
Abstract（参考訳）: 改良された微調整(SFT)と強化学習(RL)は、大規模言語モデルの訓練後の標準パラダイムとなっている。このパラダイムはRL探索のコールドスタートを提供し、オンラインサンプリングが不十分な正のサンプルを生成する純粋なRLの非効率性を回避する。しかし、実際には、既存のアプローチでは、RLフェーズと比較して、SFTの初期化に少量のデータを使用することが多いため、モデルが限られたサンプルに適合し、事前訓練された分布から逸脱する可能性がある。この分布シフトは、その後のRLトレーニング中に効果的に探索するモデルの能力を阻害する。この課題に対処するために、SFTは特定のコンテンツを記憶するのではなく、タスク関連能力の活性化を優先すべきであると提案する。この線に沿って、基準モデルから高エントロピーまたは高KL分岐を示すトークンを選択的にマスクするEKSFT(Entropy-KL Selective Fine-Tuning)を提案する。高不確実で分布シフトのトークンを模倣から除外することで、EKSFTはモデルの事前訓練された分布の完全性を維持しながらタスク固有の知識を注入する。数学的推論ベンチマークに関する実証的な評価は、EKSFTが標準SFTを一貫して上回っていることを示している。 EKSFTモデルによるさらなるRL微調整により、連続的にRL後の性能が向上し、RLステージの探索が改善したことを示す。私たちのコードとデータセットはhttps://github.com/MINE-USTC/EKSFT.comで公開されています。

論文の概要: Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

関連論文リスト