Fugu-MT 論文翻訳(概要): Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

論文の概要: Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

arxiv url: http://arxiv.org/abs/2509.01790v1
Date: Mon, 01 Sep 2025 21:38:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.845113
Title: Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
Title（参考訳）: 欠陥と人工物 : LLM評価におけるプロンプト感度の再考
Authors: Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, Yao Qin,
Abstract要約: ハイプロンプト感度は、大規模言語モデルのコアリミットとして広く受け入れられている。広く報告されているハイプロンプト感度は、本当にLLMの本質的な弱点なのか、それとも、主に評価プロセスの成果物なのか? 即発感度の多くは,ログライクなスコアリングや厳密な回答マッチングなど,評価手法に起因していることがわかった。
参考スコア（独自算出の注目度）: 34.51801559719707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.
Abstract（参考訳）: プロンプト感度(英: Prompt sensitivity)とは、言語モデル(LLM)の性能が大きな変化をもたらす現象であり、LLMのコアリミットとして広く受け入れられている現象である。この論文では、この問題を再考し、次のように問いかけます。広く報告されているハイプロンプト感度は、本当にLLMの本質的な弱点なのか、それとも、主に評価プロセスの成果物なのか? この問題に対処するため,12種類のプロンプトテンプレート上での複数選択タスクとオープンエンドタスクを含む,6つのベンチマークで7つのLSM(例, GPT, Gemini family)を体系的に評価した。本研究は, 意味論的に正しい応答を, 同義語やパラフレーズなど別の言い回しで表すような, 対数的なスコアリングや厳密な回答マッチングなど, ヒューリスティックな評価手法に起因していることがわかった。 LLM-as-a-Judgeの評価を採用すると、性能のばらつきが大幅に減少し、プロンプト間のモデルランキングが一貫した高い相関関係が観察される。以上の結果から,現代のLCMは従来考えられていたよりもテンプレートのプロンプトに頑健であり,モデルの欠陥よりも迅速な感度が評価の成果である可能性が示唆された。

論文の概要: Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

関連論文リスト