Fugu-MT 論文翻訳(概要): Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

論文の概要: Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

arxiv url: http://arxiv.org/abs/2601.21464v1
Date: Thu, 29 Jan 2026 09:41:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.708051
Title: Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation
Title（参考訳）: 非検証型学習のための会話:メタ評価による自己進化型LLM
Authors: Yuan Sui, Bryan Hooi,
Abstract要約: CoNLは、マルチエージェントのセルフプレイを通じて生成、評価、メタ評価を統合するフレームワークである。 CoNLは、安定したトレーニングを維持しながら、自己回帰ベースラインよりも一貫した改善を実現している。
参考スコア（独自算出の注目度）: 56.84819098277464
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
Abstract（参考訳）: クリエイティブ・ライティング、対話、倫理的推論など、検証不可能なタスクのための大規模言語モデル(LLM)のトレーニングは、地味なラベルがないため、依然として困難である。 LLM-as-Judgeアプローチは人間のフィードバックに代わるスケーラブルな代替手段を提供するが、それらは根本的な制限に直面している。裁判官が良い解決策を認識できなければ、有用な訓練信号を提供できなくなり、評価バイアス(例えば、品質よりも冗長性を好む)は未適応のままである。これはメタ評価の動機であり、評価者自身を評価し改善する能力である。マルチエージェント・セルフプレイによる生成,評価,メタ評価を統一するフレームワークであるCoNLを紹介する。私たちの重要な洞察: 批判的品質は、他の人がソリューションを改善するのに役立つかどうかによって測定できます。 CoNLでは、同じ方針を共有する複数のエージェントが構造化された会話を行い、ソリューションを提案し、批判し、修正する。ソリューションの改善を可能にする批評は、診断報酬を獲得し、メタ評価の明確な監督を作成し、外部の判断や根拠の真実なしに、自己プレイによる生成と判断能力の共同最適化を可能にする。 5つのベンチマークの実験では、CoNLは安定したトレーニングを維持しながら、自己回帰ベースラインよりも一貫した改善を達成している。

論文の概要: Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

関連論文リスト