Fugu-MT 論文翻訳(概要): EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

論文の概要: EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

arxiv url: http://arxiv.org/abs/2605.03871v1
Date: Tue, 05 May 2026 15:31:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 19:35:44.007474
Title: EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Title（参考訳）: EvoLM: 自己進化型言語モデル
Authors: Shuyue Stella Li, Rui Xin, Teng Xiao, Yike Wang, Rulin Shao, Zoey Hao, Melanie Sclar, Sewoong Oh, Faeze Brahman, Pang Wei Koh, Yulia Tsvetkov,
Abstract要約: 言語モデルは事前学習からかなりの評価知識を符号化する。現在のポストトレーニング手法は、報酬信号を生成するために外部の監督に依存している。 EVOLMは、モデルの評価能力を明示的な識別的ルーブリックに構造化する手法である。
参考スコア（独自算出の注目度）: 86.49781345669436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models encode substantial evaluative knowledge from pretraining, yet current post-training methods rely on external supervision (human annotations, proprietary models, or scalar reward models) to produce reward signals. Each imposes a ceiling. Human judgment cannot supervise capabilities beyond its own, proprietary APIs create dependencies, and verifiable rewards cover only domains with ground-truth answers. Self-improvement from a model's own evaluative capacity is a reward source that scales with the model itself, yet remains largely untapped by current methods. We introduce EVOLM, a post-training method that structures this capacity into explicit discriminative rubrics and uses them as training signal. EVOLM trains two capabilities within a single language model in alternation: (1) a rubric generator producing instance-specific evaluation criteria optimized for discriminative utility, which maximizes a small frozen judge's ability to distinguish preferred from dispreferred responses; and (2) a policy trained using those rubric-conditioned scores as reward. All preference signals are constructed from the policy's own outputs via temporal contrast with earlier checkpoints, requiring no human annotation or external supervision. EVOLM trains a Qwen3-8B model to generate rubrics that outperform GPT-4.1 on RewardBench-2 by 25.7%. The co-trained policy achieves 69.3% average on the OLMo3-Adapt suite, outperforming policies trained with GPT-4.1 prompted rubrics by 3.9% and with the state-of-the-art 8B reward model SkyWork-RM by 16%. Overall, EVOLM demonstrates that structuring a model's evaluative capacity into co-evolving discriminative rubrics enables self-improvement without external supervision.
Abstract（参考訳）: 言語モデルは事前訓練からかなりの評価知識を符号化するが、現在の訓練後の方法は報酬信号を生成するために外部の監督(人間のアノテーション、プロプライエタリなモデル、スカラーの報酬モデル)に依存している。それぞれ天井が設けられている。人間による判断は、独自のAPIが依存関係を生成し、検証可能な報酬は、根本的な答えを持つドメインのみをカバーする。モデル自身の評価能力からの自己改善は、モデル自体と共にスケールする報奨源であるが、現在の手法にほとんど触れられていない。本稿では,この能力を明示的な識別的ルーリックに構造化し,学習信号として使用するポストトレーニング手法であるEVOLMを紹介する。 EVOLMは,(1)識別ユーティリティに最適化されたインスタンス固有の評価基準を生成するルーリックジェネレータを作成した。すべての選好信号は、以前のチェックポイントと時間的に対比してポリシーの出力から作成され、人間のアノテーションや外部の監視を必要としない。 EVOLMはQwen3-8Bモデルを訓練し、RewardBench-2でGPT-4.1を上回った潤滑剤を25.7%出力した。 OLMo3-Adaptスイートの平均69.3%を達成し、GPT-4.1でトレーニングされたポリシーよりも3.9%、最先端の8B報酬モデルであるSkyWork-RMを16%上回った。全体として、EVOLMはモデルの評価能力を共進化する差別的ルーブリックに構造化することで、外部の監督なしに自己改善できることを示した。

論文の概要: EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

関連論文リスト