Fugu-MT 論文翻訳(概要): Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

論文の概要: Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

arxiv url: http://arxiv.org/abs/2605.23384v1
Date: Fri, 22 May 2026 08:54:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.270757
Title: Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Title（参考訳）: リワードとしてのメタ認知:知識と規制信号によるLLM推論の強化
Authors: Sirui Chen, Lei Xu, Yuying Zhao, Yutian Chen, Yu Wang, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu,
Abstract要約: 本稿ではメタ認知に触発されたRLフレームワークであるメタ認知・アズ・リワード(MaR)を紹介する。 MaRは2つの一般的なプロセス次元を推論する。 MaRはモデル性能を継続的に改善し、ベースモデルよりも最大7.7%向上した。
参考スコア（独自算出の注目度）: 75.25256166997414
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.
Abstract（参考訳）: 近年のRL法はLLMの推論能力を大幅に改善した。既存の報酬設計は,(1)検証可能な報酬(RLVR)を用いた強化学習(Reinforcement Learning)は,実行可能チェックから結果信号を導出するが,中間的推論行動に対する限定的なガイダンスを提供する。 2) ラブラクス・アズ・ア・リワード(RaR)は、推論品質とタスクコンプライアンスを評価するために自然言語のルーリックを用いて最終回答チェックを行うだけでなく、インスタンス固有のルーリックと実質的な設計作業を必要とすることが多い。これらの問題に対処するために、メタ認知に触発されたRLフレームワークであるメタ認知・アズ・ア・リワード(MaR)を紹介します。一手作りのインスタンス固有のルーリックなしでタスク関連情報を識別するメタ認知知識二メタ認知的規制であって、最終回答結果以上の報酬指導を行うための推論プロセスを計画し、調整すること。 MaRの足場は、明確なメタ認知コンポーネントへのロールアウトをモデル化し、タスク知識のカバレッジ、規制の忠実さ、最終回答の正しさよりも、軌道レベルの報酬で最適化する。このようにして、MaRは報奨フィードバックを拡張して、一般のメタ認知次元において報奨信号をグラウンド化しながら軌道を推論する。 22ベンチマークの実験では、MaRはモデル性能を一貫して改善し、ベースモデルよりも7.7%、バニラDAPOよりも11.0%向上した。特に、Qwen3.5-9B + MaRはフロンティアモデルとのギャップを狭め、GPT-OSS-120Bを抜いて、複数のベンチマークでより強力なモデルを上回っている。プロセスレベルの分析はさらに、推論プロセスの品質を大幅に改善したことを示している。 MaRはドメイン外のデータセットにも一般化されており、MaRでトレーニングされたモデルは、対応するベースモデルよりも平均的に改善されている。

論文の概要: Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

関連論文リスト