Fugu-MT 論文翻訳(概要): $p1$: Better Prompt Optimization with Fewer Prompts

論文の概要: $p1$: Better Prompt Optimization with Fewer Prompts

arxiv url: http://arxiv.org/abs/2604.08801v1
Date: Thu, 09 Apr 2026 22:31:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.598214
Title: $p1$: Better Prompt Optimization with Fewer Prompts
Title（参考訳）: p1$: 少ないプロンプトによるプロンプト最適化の改善
Authors: Zhaolin Gao, Yu, Wang, Bo Liu, Thorsten Joachims, Kianté Brantley, Wen Sun,
Abstract要約: システムプロンプト間の分散が大きくなるとプロンプト最適化が成功するが,システムプロンプト間の分散がシステムプロンプトの分散を支配するとフェールすることを示す。本稿では,ユーザプロンプトのサブセットを選択するシンプルなユーザプロンプトフィルタリング手法である$p1$を提案する。
参考スコア（独自算出の注目度）: 49.20082664169319
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.
Abstract（参考訳）: プロンプト最適化は、より良いシステムプロンプトを探すことで重みを更新することなく言語モデルを改善するが、その効果はタスクによって大きく異なる。最適化を急ぐための課題について検討する。システムプロンプト間の報酬分散は、応答間の分散と、システムプロンプト間の分散と、システムプロンプトの差異をキャプチャする2つのコンポーネントに分解できることを示す。プロンプト最適化は、システムプロンプト間のばらつきが十分に大きいときに成功するが、応答間のばらつきがシステムプロンプトのばらつきを支配すると失敗する。さらに,異なるユーザプロンプトが異なるシステムプロンプトを優先する異種データセットにおいて,システムプロンプト間のばらつきを低減することで,より多くのユーザプロンプトへのスケーリングが最適化を損なう可能性があることを示す。そこで本研究では,ユーザプロンプトのサブセットを選択するシンプルなユーザプロンプトフィルタリング手法である$p1$を提案する。このユーザープロンプトのサブセットは、良いシステムプロンプトと悪いシステムプロンプトを区別し、システムの最適化を容易にする。推論ベンチマークの実験によると、$p1$は完全なデータセットのトレーニングよりも高速な最適化を実現し、GEPAのような強力なベースラインを上回っている。特に、AIME 24からの2つのプロンプトのみのトレーニングでは、他の推論ベンチマークによく適応するシステムプロンプトが得られる。

論文の概要: $p1$: Better Prompt Optimization with Fewer Prompts

関連論文リスト