Fugu-MT 論文翻訳(概要): Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls

論文の概要: Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls

arxiv url: http://arxiv.org/abs/2507.17467v1
Date: Wed, 23 Jul 2025 12:46:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-24 22:33:14.991594
Title: Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls
Title（参考訳）: ビジュアル・エンタテインメント・タスクによるビジョン・ランゲージ理解の提案:約束と落とし穴
Authors: Elena Pitta, Tom Kouwenhoven, Tessa Verhoef,
Abstract要約: 本研究では、マルチモーダル言語モデルにおける視覚言語理解の信頼性調査として、視覚的エンタテインメントタスクが果たす役割について検討する。ゼロショット、少数ショット、微調整の設定で実験を行い、プロンプトデザインなどの要因がVEのパフォーマンスに与える影響について検討する。微細チューニングは強い結果をもたらし、E-SNLI-VEデータセットで83.3%の精度を達成し、最先端のOFA-Xモデルを上回っている。
参考スコア（独自算出の注目度）: 0.10923877073891446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study investigates the extent to which the Visual Entailment (VE) task serves as a reliable probe of vision-language understanding in multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. Beyond reporting performance metrics, we aim to interpret what these results reveal about the underlying possibilities and limitations of the VE task. We conduct a series of experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design, the number and order of in-context examples and access to visual information might affect VE performance. To further probe the reasoning processes of the model, we used explanation-based evaluations. Results indicate that three-shot inference outperforms the zero-shot baselines. However, additional examples introduce more noise than they provide benefits. Additionally, the order of the labels in the prompt is a critical factor that influences the predictions. In the absence of visual information, the model has a strong tendency to hallucinate and imagine content, raising questions about the model's over-reliance on linguistic priors. Fine-tuning yields strong results, achieving an accuracy of 83.3% on the e-SNLI-VE dataset and outperforming the state-of-the-art OFA-X model. Additionally, the explanation evaluation demonstrates that the fine-tuned model provides semantically meaningful explanations similar to those of humans, with a BERTScore F1-score of 89.2%. We do, however, find comparable BERTScore results in experiments with limited vision, questioning the visual grounding of this task. Overall, our results highlight both the utility and limitations of VE as a diagnostic task for vision-language understanding and point to directions for refining multimodal evaluation methods.
Abstract（参考訳）: 本研究では、LLaMA 3.2 11B Vision Modelをテストケースとして、ビジュアルエンタテインメント(VE)タスクがマルチモーダル言語モデルにおける視覚言語理解の信頼性の高いプローブとなる範囲について検討する。パフォーマンス指標の報告以外にも、VEタスクの基本的な可能性と制限について、これらの結果が示すものを理解することを目指している。我々は、ゼロショット、少数ショット、微調整の設定にまたがって一連の実験を行い、迅速な設計、コンテキスト内サンプルの数と順序、視覚情報へのアクセスがVEのパフォーマンスにどのように影響するかを探索した。モデルの推論過程をさらに解明するために,説明に基づく評価を用いた。結果は、3ショットの推論がゼロショットのベースラインより優れていることを示している。しかし、追加の例は、利点を提供するよりも多くのノイズをもたらします。さらに、プロンプト内のラベルの順序は、予測に影響を与える重要な要因である。視覚情報がない場合、モデルには幻覚と想像の傾向があり、言語的先行性に対するモデルの過度な依存に関する疑問が提起される。微細チューニングは強い結果をもたらし、E-SNLI-VEデータセットで83.3%の精度を達成し、最先端のOFA-Xモデルを上回っている。さらに、説明評価は、細調整されたモデルが、BERTScore F1スコア89.2%の人間と同様の意味論的に意味のある説明を提供することを示した。しかし、限られた視力を持つ実験では、BERTScoreに匹敵する結果が得られており、このタスクの視覚的根拠が疑問視されている。本研究は,視覚言語理解のための診断タスクとしてのVEの有用性と限界の両方を強調し,マルチモーダル評価手法を改良するための方向を示す。

論文の概要: Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls

関連論文リスト