Fugu-MT 論文翻訳(概要): Which Pairs to Compare for LLM Post-Training?

論文の概要: Which Pairs to Compare for LLM Post-Training?

arxiv url: http://arxiv.org/abs/2606.19607v1
Date: Wed, 17 Jun 2026 21:19:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.547988
Title: Which Pairs to Compare for LLM Post-Training?
Title（参考訳）: LLMポストトレーニングに比較すべきペアは何か?
Authors: Jiangze Han, Vineet Goyal, Will Ma,
Abstract要約: 本稿では,選好に基づくポストトレーニングにおいて,どのペアを比較すべきかを検討する。サンプル設計問題として比較キュレーションを定式化し、最終方針の品質による設計評価を行う。合成設定と言語モデル後学習ベンチマークの実験により,提案設計は共通比較選択よりもサンプル効率を一貫して改善することを示した。
参考スコア（独自算出の注目度）: 8.998543739618077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.
Abstract（参考訳）: 嗜好ベースのポストトレーニングは、言語モデルを整合させるための中心的なパラダイムとなっている。一般的なデータ収集戦略は、各プロンプトに対して小さな補完セットを生成し、結果として得られる比較ペアにラベルをつけることである。しかしながら、人間の嗜好ラベルは、追加の完了を発生させるよりもはるかに高価であり、同じラベルの予算の異なる使用を示唆する: より大きな完成のプールを生成するが、ラベルは最も有意義な比較ペアのみを生成する。本稿では,選好に基づくポストトレーニングにおいて,どのペアを比較すべきかを検討する。サンプル設計問題として比較キュレーションを定式化し、選好に基づく後学習目標の下で最終方針の品質による設計を評価する。我々はこのフレームワークをDPO(Direct Preference Optimization)のためにインスタンス化し、ラベル付きペアの選択がDPOトレーニングを通じて下流の政策パフォーマンスにどのように伝播するかを分析する。本研究の主な成果は, DPO 学習方針の学習後最適性ギャップの上限値と下限値との整合性である。このバウンダリは,ラベル割り当てをパラメータ推定誤差と政策準最適性にリンクする単一設計依存情報行列を用いて,比較選択が下流性能に影響を与えることを示す。これにより、予算化された比較キュレーションのための明示的な最適化基準が得られ、大きな生成された完了プールから情報ペアを選択するための実用的なサンプリング設計が動機となる。合成設定と言語モデル後学習ベンチマークの実験により,提案設計は共通比較選択ヒューリスティックよりもサンプル効率を一貫して改善することが示された。

論文の概要: Which Pairs to Compare for LLM Post-Training?

関連論文リスト