Fugu-MT 論文翻訳(概要): MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

論文の概要: MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

arxiv url: http://arxiv.org/abs/2511.14159v1
Date: Tue, 18 Nov 2025 05:48:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:52.957307
Title: MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
Title（参考訳）: MVI-Bench:LVLMにおける視覚入力のミスリードに対するロバスト性評価のための総合ベンチマーク
Authors: Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng,
Abstract要約: MVI-Benchは、視覚入力がLVLM(Large Vision-Language Models)の堅牢性をいかに損なうかを評価するための最初の総合的なベンチマークである。 MVI-Benchは、視覚概念、視覚属性、視覚関係という3つの階層的な視覚的インプットに焦点を当てている。 MVI-Sensitivityは、LVLMのロバスト性を粒度レベルで特徴づける新しい計量である。
参考スコア（独自算出の注目度）: 22.99984702966184
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.
Abstract（参考訳）: LVLM(Large Vision-Language Models)の堅牢性を評価することは、開発を継続し、実際のアプリケーションに責任を負うために不可欠である。しかし、既存の堅牢性ベンチマークは一般的に幻覚や誤解を招くテキスト入力に焦点を当てているが、視覚的理解を評価する際の誤解を招く視覚的入力によってもたらされる、等しく批判的な課題を概ね見落としている。この重要なギャップを埋めるために、私たちはMVI-Benchを紹介します。これは、MVI-BenchがLVLMの堅牢性をいかに損なうかを評価するために特別に設計された、最初の包括的なベンチマークです。 MVI-Benchの設計は基本的な視覚的プリミティブに基づいており、視覚概念、視覚属性、視覚関係という3つの階層的な視覚的インプットに基づいている。この分類法を用いて、6つの代表的なカテゴリを整理し、専門的な注釈付きVQAインスタンス1,248をコンパイルする。 MVI-Sensitivityは,LVLMのロバスト性を粒度レベルで特徴付ける新しい指標である。 MVI-Benchの詳細な分析は、より信頼性が高く堅牢なLVLMの開発をガイドする実行可能な洞察を提供する。ベンチマークとコードベースはhttps://github.com/chenyil6/MVI-Bench.orgからアクセスできる。

論文の概要: MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

関連論文リスト