Fugu-MT 論文翻訳(概要): From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

論文の概要: From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

arxiv url: http://arxiv.org/abs/2512.02580v1
Date: Tue, 02 Dec 2025 09:48:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-03 21:04:45.811275
Title: From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
Title（参考訳）: 模倣から差別へ:クロスドメイン推論タスクの一般化的アドバンテージメカニズムに向けて
Authors: Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang,
Abstract要約: 本稿では,利点信号に基づく適応的なカリキュラム機構として,*CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization)を提案する。提案したメカニズムは、正のみの利点サンプルによる模倣学習をブートストラップし、堅牢な基礎を確立する。我々の手法は、数学的推論タスクの安定かつ重要な改善を一貫して達成する。
参考スコア（独自算出の注目度）: 21.71569299337131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.
Abstract（参考訳）: 強化学習は、大規模言語モデルの訓練後のパラダイムとして現れ、推論能力を高めている。このようなアプローチでは、各サンプルの利点値を計算し、期待よりも良いか悪いパフォーマンスを反映し、トレーニング用の正信号と負信号の両方を出力する。しかし、既存の方法、特に初期段階からの2つの信号の無差別混合は、曖昧なガイダンスと限られた利得につながる可能性がある。この問題に対処するため,利点信号に基づく適応的なカリキュラム機構である*CAPO*(**C**urriculum **A**dvantage **P*olicy **O**ptimization)を提案する。提案機構は, 擬似学習を正にのみ有利なサンプルでブートストラップし, 強靭な基礎を確立するとともに, 識別能力を育むために負の信号を導入し, 複雑なシナリオをまたいだ一般化を改善する。 GRPO, PPO, RLOO, Reinforce++などの多種多様な最適化手法と相まって, 数学的推論タスクの安定かつ重要な改善を一貫して達成し, マルチモーダルなグラフィカルユーザインタフェース(GUI)推論シナリオを効果的に一般化し, 汎用的で堅牢な最適化フレームワークとして確立する。

論文の概要: From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

関連論文リスト