Fugu-MT 論文翻訳(概要): Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

論文の概要: Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

arxiv url: http://arxiv.org/abs/2603.19470v1
Date: Thu, 19 Mar 2026 21:04:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:38.887337
Title: Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
Title（参考訳）: 適応層状摂動:LLM RLのオフポリティ補正
Authors: Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang,
Abstract要約: 政策の不安定さやトレーニング推論ミスマッチといった非政治的な問題は、トレーニングの安定性の大きなボトルネックとなっている。更新中に各レイヤの入力隠れ状態に小さな学習可能な摂動を注入することにより、適応層摂動(ALP)を提案する。 ALPは、更新されたポリシーが推論ポリシーから過度に逸脱することを防ぎ、推論ポリシーファミリをミスマッチノイズでカバーするようにポリシーファミリを拡大する。
参考スコア（独自算出の注目度）: 26.49103739671071
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.
Abstract（参考訳）: 政策の不安定性やトレーニング推論ミスマッチといった非政治的な問題は、トレーニングの安定性とLLM RLのさらなる探索において大きなボトルネックとなっている。推論効率を高めるために、推論と更新されたポリシーの間の分配ギャップが増加し、重み付けされた重要度比が生まれる。ヘビーテールの比率は、政策が局所的に鋭いときに生じ、さらに急勾配を膨らませ、信頼領域の外で更新をプッシュすることができる。そこで本稿では,更新中に各レイヤの入力された隠れ状態に小さな学習可能な摂動を注入することにより,ALP(Adaptive Layerwise Perturbation)を提案する。直感的には、中間表現に制御ノイズを加えることで、ALPは、更新されたポリシーが推論ポリシーから過度に逸脱することを防ぐとともに、推論ポリシーファミリをミスマッチノイズでカバーするようにポリシーファミリを拡大する。これにより、フラット化された分布は、更新および推論ポリシーギャップを自然に締め付け、重要度のテールを低減し、トレーニング安定性を維持することができる。これはさらに実証的に検証される。単ターン数学と多ターンツール統合推論タスクの実験は、ALPが最終性能を向上するだけでなく、反復訓練中に重要度比尾とKLスパイクの爆発を避けるとともに、探索が促進されたことを示している。アブレーションは、すべての層にまたがる表現レベルの摂動が最も効果的であり、部分層とロジットのみの変種を著しく上回っていることを示している。

論文の概要: Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

関連論文リスト