Fugu-MT 論文翻訳(概要): Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

論文の概要: Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

arxiv url: http://arxiv.org/abs/2605.14392v1
Date: Thu, 14 May 2026 05:14:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-16 00:43:04.097893
Title: Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis
Title（参考訳）: 環境構築の学習: 検証可能な環境合成による自己進化型推論RL
Authors: Yucheng Shi, Zhenwen Liang, Kishan Panaganti, Dian Yu, Wenhao Yu, Haitao Mi,
Abstract要約: 我々は、モデルが単に問題やトレースを生成して模倣するだけでなく、それを訓練する環境を構築する、自己改善型言語モデルに対するビジョンを追求する。このビューをEvoEnvでインスタンス化します。EvoEnvは10個のシードからPython環境を合成する単一ポリシージェネレータ、ソルバメソッドです。
参考スコア（独自算出の注目度）: 37.03537014865641
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.
Abstract（参考訳）: 我々は、モデルが単に問題やトレースを生成して模倣するだけでなく、それを訓練する環境を構築する、自己改善型言語モデルに対するビジョンを追求する。ゼロデータ推論RLでは、これはデータ生成ループから環境構築ループに自己改善を組み替え、各アーティファクトはインスタンスをサンプリングし、参照を計算し、レスポンスをスコアする再利用可能なオブジェクトである。このビジョンが1つの性質で改善のヒンジを維持できるかは問わない:環境は安定な解法を示さなければならず、非対称性を検証し、新しいインスタンス上で自然言語で確実に実行できないようにすると、モデルは託宣書を書くことができる必要がある。この非対称性は2つの相補的な形式を取る。動的プログラムやグラフトラバーサルが一度コンパイルされると、境界のない多くのキャリブレーションされたインスタンスが生成される。他のものは本質的には解決が難しいが、植え付けされたサブセットサムや制約満足度などの検証は容易である。どちらも、検証器をゲームすることでポリシーが閉じられないという提案と解決の間に永続的なギャップを生じさせ、学習者が改善するにつれて報奨情報を維持するのは、このギャップである。このビューをEvoEnvでインスタンス化します。EvoEnvは、シングルポリシージェネレータ、ソルバメソッドで、10個のシードからPython環境を合成し、ステージ化された検証、セマンティック自己レビュー、ソルバ相対困難校正、ノベルティチェックの後にのみ承認する。 Qwen3-4B-Thinking、固定された公開データRLVR、固定された手作り環境RLVRは平均を下げ、EvoEnvは72.4から74.8に改善し、相対的な利得は3.3%である。安定的な自己改善は、より多くの合成データを生成することではなく、難易度が自身の到達範囲を超えて構造的に維持される世界を構築することを学ぶモデルに依存します。

論文の概要: Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

関連論文リスト