Fugu-MT 論文翻訳(概要): Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

論文の概要: Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

arxiv url: http://arxiv.org/abs/2604.20051v1
Date: Tue, 21 Apr 2026 23:21:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:10.886066
Title: Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
Title（参考訳）: プレトレーニングテキスト上でのルーブリックベースのセルフプレイによるオープンエンドタスクのブートストラップ後信号
Authors: Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang, Claire Cardie,
Abstract要約: 大規模言語モデル(LLM)をトレーニングするための有望なパラダイムとして、セルフプレイが登場した。 POPは,同一のLLMを用いて,各例の入力出力ペアとともに,評価ルーリックを合成するセルフプレイフレームワークである。 Qwen-2.5-7Bでは、POPは訓練済みモデルと訓練済みモデルの両方のパフォーマンスを異なるタスクで向上させる。
参考スコア（独自算出の注目度）: 14.278605706996474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.
Abstract（参考訳）: 最近、Large Language Models(LLM)をトレーニングするための有望なパラダイムとしてセルフプレイが登場した。セルフプレイでは、ターゲットLLMはタスク入力(例えば質問)を生成し、タスク出力(例えば回答)を生成してそれ自身に対処する。報酬モデルは、出力を評価し、報酬は、通常、強化学習(RL)を介して、LLMのトレーニングに使用される。セルフプレイは最小限の監督コストを発生させるため、従来の人間や高価なプロプライエタリなモデルで書かなければならない高品質なインプット・アウトプット・ペアを必要とするLLMの訓練後において特に有用である。しかし、既存の研究は、数学やコーディングのような検証可能なタスクにのみセルフプレイを探求している。代わりに、より現実的なオープンエンドタスクに拡張しようとしています。特に,同一のLLMを用いた自己再生フレームワークであるPOPを提案し,各例について,入力出力ペアとともに評価ルーリックを合成する。その後、ルーブリックを使用して出力を評価し、モデルをトレーニングする。さらに,コンテンツに富む事前学習コーパスを基盤として,(1)生成検証ギャップの確保と報奨ハッキングの低減,(2)モード崩壊の防止を図る。 Qwen-2.5-7Bでは、POPはトレーニング済みモデルとトレーニング済みモデルの両方のパフォーマンスを、長期医療のQAからクリエイティブな文章作成、後続のインストラクションまで、さまざまなタスクで向上させる。

論文の概要: Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

関連論文リスト