Fugu-MT 論文翻訳(概要): Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

論文の概要: Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

arxiv url: http://arxiv.org/abs/2604.21523v1
Date: Thu, 23 Apr 2026 10:36:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.447683
Title: Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
Title（参考訳）: 視線・視線モデルにおける盲点の発見
Authors: Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra,
Abstract要約: I2TタスクとT2Iタスクの両方で評価器VLMの信頼性を体系的に評価する。我々は,物体の幻覚,空間的推論,事実的接地,視覚的忠実度など,重要な誤り次元に沿って出力品質を低下させる摂動を導入する。
参考スコア（独自算出の注目度）: 18.001586760420484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.
Abstract（参考訳）: VLM(Large Vision-Language Models)は、視覚的質問応答やテキスト・トゥ・イメージ(T2I)生成タスクなど、他のモデルの出力を評価するために使われることが多い。このような信頼性の高まりにもかかわらず、これらの評価器VLMの信頼性はまだ検討中である。本研究では,評価用VLMの信頼性をI2TタスクとT2Iタスクの両方で体系的に評価する。本研究では,物体の幻覚,空間的推論,事実的グラウンド,視覚的忠実度など,重要な誤り次元に沿って出力品質を低下させるターゲット摂動を導入する。これらの摂動は、評価器VLMが評価においてこれらの品質劣化エラーを確実に考慮できるかどうかをテストする。 40の摂動次元にまたがる4000以上の摂動インスタンスの総合的なベンチマークを用いて、単問合せスコア、ペア比較、参照誘導パラダイムを用いて4つの顕著なVLMを評価する。 50%を超える場合、特に微細な構成誤差や空間誤差に苦しむ場合があり、入力画像に矛盾する幻覚的内容に敏感である場合が多い。ペアワイズ比較は、失敗率が持続するにもかかわらず、より信頼性が高いことを証明します。これらの結果は、現在の評価VLMの信頼性の低い性質を浮き彫りにして、ベンチマークや開発決定のデプロイに注意を促します。コードとデータは公開されています。

論文の概要: Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

関連論文リスト