Fugu-MT 論文翻訳(概要): Clipping-Free Policy Optimization for Large Language Models

論文の概要: Clipping-Free Policy Optimization for Large Language Models

arxiv url: http://arxiv.org/abs/2601.22801v1
Date: Fri, 30 Jan 2026 10:32:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.383018
Title: Clipping-Free Policy Optimization for Large Language Models
Title（参考訳）: 大規模言語モデルのためのクリッピングフリーポリシー最適化
Authors: Ömer Veysel Çağatan, Barış Akgün, Gözde Gül Şahin, Xuandong Zhao,
Abstract要約: 強化学習は、訓練後の大規模言語モデルの中心となっている。支配的なアルゴリズムは、大規模に最適化問題を導入するためのクリッピング機構に依存しています。本稿では,クリッピングを全変動ばらつき制約から導いた凸ペナルティに置き換えるクリッピング自由政策最適化を提案する。
参考スコア（独自算出の注目度）: 30.663054788473598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.
Abstract（参考訳）: 強化学習は、訓練後の大規模言語モデルの中心となっているが、主流のアルゴリズムは、ゼログレードのリージョン、報酬のハッキング、トレーニング不安定性など、大規模に最適化問題を提起するクリップ機構に依存している。本研究では,厳密な境界のない安定的な政策更新を実施可能な,至るところで微分可能な目標を導出する,全変分数制約から導出される凸2次ペナルティに,ヒューリスティックなクリッピングを置き換えたクリッピング自由政策最適化(CFPO)を提案する。 CFPOは推論とアライメントの両方で評価する。 CFPOは、安定したトレーニング体制を拡張しながら、ダウンストリームベンチマーク上のクリップベースのメソッドにマッチする。アライメントにおいて、CFPOは冗長性の利用を軽減し、競争力のある命令追従性能を達成しつつ、能力劣化を低減する。 CFPOは1行のコードの変更だけで、追加のハイパーパラメータを必要としない。この結果から,CFPOはLCMポストトレーニングのためのクリッピング方式に代わる,有望なドロップイン方式であることが示唆された。

論文の概要: Clipping-Free Policy Optimization for Large Language Models

関連論文リスト