Fugu-MT 論文翻訳(概要): ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

論文の概要: ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

arxiv url: http://arxiv.org/abs/2604.27644v2
Date: Thu, 07 May 2026 08:46:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.2836
Title: ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning
Title（参考訳）: ANCORA: 検証可能な推論のためのマニフォールド・アンコールド・セルフプレイによる質問の学習
Authors: Chengcao Yang,
Abstract要約: オープンエンドカリキュラムの自己プレイに向けてのパラダイムシフトを提案する。本稿では、新しい仕様を合成するプロポーラと、検証されたソリューションを生成する解決器とを、ポリシーが交互に扱うANCORAを紹介する。我々はANCORAがDafny2Verus pass@1を26.6%のSFTベースラインから81.5%のテストタイムトレーニング(TTT, 0-shot)に引き上げたことを示す。移行設定では、Dafny2Verusシードからのトレーニングは、保持されたMBPPとHumanEvalで36.2%と17.2%のpass@1を得る。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a paradigm shift toward open-ended curriculum self-play: rather than learning to answer on a fixed prompt set, a unified policy learns to question: generating verifiable problems, solving them, and turning verifier feedback into self-improvement without human-annotated solutions. We introduce ANCORA, in which the policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions, anchored by three load-bearing mechanisms: a two-level group-relative update coupling Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT projecting the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG whose policy-induced problem set can provably expand under self-composition. Without these stabilizers, sparse verifier feedback drives Proposer collapse even under MLRL-aligned rewards; with them, ANCORA bootstraps a verifiable curriculum from zero human solutions. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in test-time training (TTT, 0-shot), outperforming PSV self-play by 15.8 points despite PSV's 1-shot inference; in a transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.
Abstract（参考訳）: 固定されたプロンプトセットで答えることを学ぶのではなく、検証可能な問題を生成し、それを解決し、検証者からのフィードバックを人手による注釈のない自己改善に変換するという、統一されたポリシーが疑問を呈する。本稿では, 新たな仕様を合成するプロポーラと, 検証されたソリューションを生成するソルバーとを交互に構成するANCORAを紹介する。2レベルのグループ相対更新結合ゾルバーの利点を伴う仕様間のプロポーラの利点解試行間のアドバンテージ反復自己蒸留 SFT の基本モデルを RL 以前の有効出力多様体に射影する反復自己蒸留 SFT と, 政策誘導問題セットが自己合成の下で確実に拡張できるUPB誘導カリキュラム DAG である。これらの安定化器がなければ、スパース検証器のフィードバックは MLRL の報酬の下でもプロポーラの崩壊を招き、ANCORA はゼロヒューマンソリューションから検証可能なカリキュラムをブートストラップする。ヴァースで実証されたANCORAは、Dafny2Verus pass@1を26.6%のSFTベースラインから81.5%に引き上げ、PSVの1ショットの推論にもかかわらずPSVセルフプレイを15.8ポイント上回った。

論文の概要: ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

関連論文リスト