Fugu-MT 論文翻訳(概要): Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

論文の概要: Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

arxiv url: http://arxiv.org/abs/2511.20836v1
Date: Tue, 25 Nov 2025 20:37:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-27 18:37:58.851984
Title: Structured Prompting Enables More Robust, Holistic Evaluation of Language Models
Title（参考訳）: Structured Promptingは、言語モデルのよりロバストで全体論的評価を可能にする
Authors: Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus, Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Yifan Mai, Jordan Cahoon, Michael Pfeffer, Roxana Daneshjou, Sanmi Koyejo, Emily Alsentzer, Percy Liang, Christopher Potts, Nigam H. Shah, Akshay S. Chaudhari,
Abstract要約: 言語モデル(LM)は、ドメイン間でますます採用されている。パフォーマンスを正確に見積もる高品質なベンチマークフレームワークは、デプロイメント決定を導く上で不可欠です。本稿では,構造化プロンプト手法を導入した DSPy+HELM フレームワークを提案する。
参考スコア（独自算出の注目度）: 63.93860306068057
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
Abstract（参考訳）: 言語モデル(LM)がドメインにまたがって採用されるにつれて、パフォーマンスを正確に見積もる高品質なベンチマークフレームワークは、デプロイメントの決定を導く上で不可欠である。 Holistic Evaluation of Language Models (HELM) のようなフレームワークはタスク間で幅広い評価を可能にするが、彼らはしばしばLM全体にわたって一般化できない固定的なプロンプトに依存し、非表現的なパフォーマンス推定をもたらす。各LMの天井(プロンプトの変更によって達成可能な最大値)を推定しなければ、性能を過小評価するリスクがある。 DSPyのような宣言的プロンプトフレームワークは、タスクごとに最適化可能な構造化プロンプトを作成することで、手動プロンプトエンジニアリングに代わるスケーラブルな代替手段を提供する。しかし、そのようなフレームワークは確立されたベンチマークで体系的に評価されていない。本稿では、より正確なLMベンチマークを可能にする構造的プロンプト手法を導入し、再現可能なDSPy+HELMフレームワークを提案する。 4つのプロンプト法を用いて,既存のHELMベースラインスコアに対して,7つのベンチマーク(一般/医療領域)にまたがる4つのフロンティアLMを評価する。私たちは、構造化されたプロンプトなしでそれを見つける。 (i)HELMはLM性能を過小評価する(平均4%) (ii) 性能評価はベンチマークによって異なる(+2%標準偏差)。 (3)性能差を誤記する(3/7ベンチマークでランキングが反転する)。 (4)推論(チェーン・オブ・シント)の導入により,設計の促進にLM感度が低下する(プロンプト間のΔが小さくなる)。我々の知る限り、これはベンチマークとプロンプトメソッドでLMの挙動を実証的に特徴づける最初の大規模ベンチマーク研究であり、スケーラブルなパフォーマンス天井推定によりより意思決定に使えるベンチマークが可能になることを示している。私たちはオープンソースです (i)DSPy+HELM統合(https://github.com/stanford-crfm/helm/pull/3893) (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm)。

論文の概要: Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

関連論文リスト