Fugu-MT 論文翻訳(概要): On the Role of Difficult Prompts in Self-Play Preference Optimization

論文の概要: On the Role of Difficult Prompts in Self-Play Preference Optimization

arxiv url: http://arxiv.org/abs/2510.05534v1
Date: Tue, 07 Oct 2025 02:47:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.077338
Title: On the Role of Difficult Prompts in Self-Play Preference Optimization
Title（参考訳）: セルフプレイ選好最適化における難易度プロンプトの役割について
Authors: Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing,
Abstract要約: 本研究では,難易度の異なるプロンプトが自己再生選好の最適化にどのように影響するかを検討する。その結果,難解なプロンプトは自己再生最適化性能が著しく劣っていることがわかった。本稿では,難解なプロンプトが最終性能に与える影響を緩和する戦略を提案する。
参考スコア（独自算出の注目度）: 62.030268525979274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We first use the mean reward of $N$ sampled responses of a prompt as a proxy for its difficulty. We find that difficult prompts exhibit substantially inferior self-play optimization performance in comparison to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. We also observe that the performance gap between difficult and easy prompts closes as the model capacity increases, suggesting that difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the negative effect of difficult prompts on final performance. We demonstrate that selectively removing an appropriate portion of challenging prompts enhances overall self-play performance, while also reporting failed attempts and lessons learned.
Abstract（参考訳）: 大規模言語モデル(LLM)を整合させるための重要なパラダイムとして、セルフプレイの選好最適化が登場している。典型的には、プロンプトに対するオンライン応答を生成する言語モデルと、選択された応答と拒否された応答の選択を誘導する報酬モデル(RM)が含まれており、直接選好最適化(DPO)でさらに訓練することができる。しかし、このパイプラインの中核的なコンポーネントであるにもかかわらず、プロンプトの役割は未解明のままである。本研究では,異なる難易度のプロンプトが自己プレイの選好最適化にどのように影響するかを検討する。まず最初に、プロンプトの$N$サンプルレスポンスの平均的な報酬を、その難しさのプロキシとして使用します。難解なプロンプトは、言語モデルにとって容易なプロンプトに比べて、かなり劣る自己演奏最適化性能を示す。さらに、トレーニングに難しいプロンプトを組み込むことは、全体的なパフォーマンスを向上させることができず、実際、簡単なプロンプトだけでのトレーニングに比べ、わずかに劣化する。また,モデルキャパシティの増大に伴って,困難かつ容易なプロンプト間の性能ギャップが縮まり,モデルのキャパシティとの相互作用が困難であることが示唆された。これらの結果に基づいて,難解なプロンプトが最終性能に与える影響を緩和する戦略を探求する。本研究では,挑戦的プロンプトの適切な部分を選択的に除去することで,全体的な自己演奏性能が向上し,失敗した試みや学習を報告できることを実証する。

論文の概要: On the Role of Difficult Prompts in Self-Play Preference Optimization

関連論文リスト