Fugu-MT 論文翻訳(概要): Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

論文の概要: Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

arxiv url: http://arxiv.org/abs/2603.02218v1
Date: Tue, 10 Feb 2026 08:12:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.080328
Title: Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
Title（参考訳）: 自己合成パイプラインが学習可能な情報の獲得を保証するときのみ、セルフプレイは進化する
Authors: Wei Liu, Siya Qi, Yali Du, Yulan He,
Abstract要約: 大規模言語モデル(LLM)は、自己進化ループを通じて改善されるシステムを構築するのにもっとも適している。持続可能な自己進化には、繰り返しにまたがる学習可能な情報を備えた自己合成データパイプラインが必要です。
参考スコア（独自算出の注目度）: 22.77669491242655
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自己進化ループを通じて改善するシステムを構築するのにもっとも適しているが、既存の提案の多くは、セルフプレイとして理解され、しばしば素早くプラトーとして理解されている。中心的な障害モードは、ループが次のイテレーションで学習可能な情報を増やすことなくより多くのデータを合成することである。自己再生コーディングタスクの実験を通じて、持続的な自己進化には、繰り返しにまたがる学習可能な情報を備えた自己合成データパイプラインが必要であることが明らかになった。自己進化型LLMが果たす三元的役割は,タスクを生成するProposer,ソリューションを試みるSolver,トレーニング信号を提供するVerifierの三元的役割と,この三元的役割の観点から学習可能な情報の獲得を共同で狙う3つのシステム設計を同定する。非対称な共進化は、役割間の弱-強-弱ループを閉じる。キャパシティの増大は、学習可能な情報の増加に合わせてパラメータと推論時間の予算を拡大する。積極的な情報探索は、飽和を防ぐための外部コンテキストと新しいタスクソースを導入する。これらのモジュールは、脆弱なセルフプレイダイナミクスから持続的な自己進化まで、測定可能なシステムレベルのパスを提供する。

論文の概要: Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

関連論文リスト