Fugu-MT 論文翻訳(概要): ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

論文の概要: ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

arxiv url: http://arxiv.org/abs/2604.27644v1
Date: Thu, 30 Apr 2026 09:35:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.024582
Title: ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning
Title（参考訳）: ANCORA: 検証可能な推論のためのマニフォールド・アンコールド・セルフプレイによる質問の学習
Authors: Chengcao Yang, Jun Chen,
Abstract要約: 言語モデルは検証可能な問題を生成し、それを解決し、その結果のフィードバックを人間の監督なしに自己改善できるのか? 本稿では、新しい仕様を合成するProposerと、検証されたソリューションを生成するSolverとを、統一的なポリシーで相互に交換するアンロックされたカリキュラムフレームワークであるANCORAを紹介する。
参考スコア（独自算出の注目度）: 6.362676503567886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introduce ANCORA, an anchored-curriculum framework in which a unified policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions. ANCORA rests on three load-bearing mechanisms: a two-level group-relative update that couples Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT that projects the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG that grows only through strictly filtered, novel, Solver-verified specifications. These stabilizers are necessary because sparse verifier feedback otherwise drives Proposer collapse even under MLRL-aligned rewards. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in the test-time-training setting under 0-shot evaluation, outperforming the PSV self-play baseline by 15.8 points despite PSV using 1-shot inference; in a separate transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.
Abstract（参考訳）: 言語モデルは検証可能な問題を生成し、それを解決し、その結果のフィードバックを人間の監督なしに自己改善できるのか? 本稿では、新しい仕様を合成するProposerと、検証されたソリューションを生成するSolverとを、統一的なポリシーで相互に交換するアンロックされたカリキュラムフレームワークであるANCORAを紹介する。 ANCORAは3つのロードバリングメカニズムを踏襲している: 2レベルのグループリレーショナルアップデート仕様にまたがるプロポーラの利点とソリューションの試みにまたがるソルバーの利点、RLの前にベースモデルを有効な出力多様体に投影する反復的な自己蒸留SFT、厳密なフィルタリングされた新規なソルバー検証仕様を通じてのみ成長するUCB誘導カリキュラムDAG。これらの安定化器は、スパース検証器のフィードバックがなければ、MLRL対応の報酬の下でもプロポーラが崩壊するので必要である。バーラスで実証されたANCORAは、Dafny2Verus pass@1を26.6%のSFTベースラインから81.5%に引き上げ、PSVのセルフプレイベースラインを1ショットの推論で15.8ポイント上回った。

論文の概要: ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

関連論文リスト