Fugu-MT 論文翻訳(概要): Promptception: How Sensitive Are Large Multimodal Models to Prompts?

論文の概要: Promptception: How Sensitive Are Large Multimodal Models to Prompts?

arxiv url: http://arxiv.org/abs/2509.03986v1
Date: Thu, 04 Sep 2025 08:13:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-05 20:21:10.097103
Title: Promptception: How Sensitive Are Large Multimodal Models to Prompts?
Title（参考訳）: Promptception: 大規模なマルチモーダルモデルがプロンプトにどれくらい敏感か?
Authors: Mohamed Insaf Ismithdeen, Muhammad Uzair Khattak, Salman Khan,
Abstract要約: プロンプトのフレーズや構造が微妙に変化しても、最大15%の精度のずれにつながる可能性がある。本稿では,LMMの迅速な感度評価のための体系的フレームワークであるPromptceptionを紹介する。この結果から,プロプライエタリなモデルの方が高速な表現に敏感であり,オープンソースモデルはより安定しているが,ニュアンスや複雑な表現に苦慮していることが明らかとなった。
参考スコア（独自算出の注目度）: 18.456808203208425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.
Abstract（参考訳）: 近年、LMM(Large Multimodal Models)の成功にもかかわらず、MCQA(Multiple-Choice Question Answering)におけるLMMの迅速な設計はいまだに理解されていない。命令文や構造が微妙に変化しても,特定のプロンプトやモデルに対して,最大15%の精度偏差が生じる可能性があることを示す。この変動は、モデルが慎重に選択されたプロンプトを使用してベストケースのパフォーマンスを報告するので、透明で公平なLMM評価の課題となる。そこで本稿では,LMMの迅速な感度評価のための体系的フレームワークであるPromptceptionを紹介する。 61のプロンプトタイプで構成され、それぞれ15のカテゴリと6のスーパーカテゴリで構成され、それぞれがプロンプト定式化の特定の側面をターゲットにしており、軽量なオープンソースモデルからGPT-4oやGemini 1.5 Proまで、MMStar、MMMU-Pro、MVBenchという3つのMCQAベンチマークで10のLMMを評価するために使用される。この結果から,プロプライエタリなモデルでは,命令セマンティクスとの密接な整合性を反映し,より高感度で複雑な表現に苦慮していることが明らかとなった。この分析に基づいて,プロプライエタリかつオープンソースなLMMに適したプロンプト原理を提案し,より堅牢で公平なモデル評価を可能にした。

論文の概要: Promptception: How Sensitive Are Large Multimodal Models to Prompts?

関連論文リスト