Fugu-MT 論文翻訳(概要): Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

論文の概要: Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.03259v1
Date: Fri, 26 Sep 2025 14:05:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-12 15:03:05.81256
Title: Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning
Title（参考訳）: メタアウェアネスは推論モデルを促進する:自己アライメント強化学習
Authors: Yoonjeon Kim, Doohyuk Jang, Eunho Yang,
Abstract要約: 自己アライメント(MASA)によるメタアウェアネスを高めるトレーニングパイプラインを設計する。既存のメタ認知推論モデルとは異なり、本手法は外部トレーニング源を必要としない。我々の戦略は、ドメイン内タスクの精度とトレーニング効率の両方において、大幅な改善をもたらす。
参考スコア（独自算出の注目度）: 38.67622953293653
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.
Abstract（参考訳）: 推論モデルに関する最近の研究は、言語モデルのメタ認識、思考の仕方を知る能力について探求している。我々は,大規模な推論モデルには,真のロールアウトと予測されたメタ情報との深刻なミスアライメントを証明することによって,このメタ認識性を欠いていることを論じる。メタ予測と真のロールアウトの整合性は、大幅なパフォーマンス向上につながると仮定する。この仮説を検証するために,メタアライメント(MASA)によるメタアライメントを向上させるトレーニングパイプラインを設計し,メタアライメントの強化が直接的に精度を向上させることを証明する。既存のメタ認知推論モデルとは異なり、我々の手法は外部の訓練源を必要としないが、自己生成信号を利用してメタ認識を訓練する。さらに,本手法は効率的なトレーニングを可能にする。一自明であるか未解決であるゼロ分散プロンプトをフィルタリングすること二正解に至りそうにないときに、長いロールアウトを切ること。我々の戦略は、ドメイン内タスクの精度とトレーニング効率の両方において大幅な改善をもたらし、ドメイン外のベンチマークに強力な一般化を示す。具体的には,1.28倍以上のGRPOトレーニングを高速化し,AIME25の精度19.3%,6つの数学ベンチマークの平均利得6.2%を達成できる。メタ認知誘導によるトレーニングはドメイン外一般化を強化し、GPQA-ダイアモンドでは3.87%、論理、科学、コーディングドメインにまたがる13のベンチマークで全体の精度が2.08%向上した。

論文の概要: Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

関連論文リスト