Related papers: MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

URL: http://arxiv.org/abs/2511.14159v1
Date: Tue, 18 Nov 2025 05:48:08 GMT
Title: MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
Authors: Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng,
Abstract summary: We introduce MVI-Bench, the first comprehensive benchmark for evaluating how Misleading Visual Inputs undermine the robustness of Large Vision-Language Models (LVLMs)<n>MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship.<n>MVI-Sensitivity is a novel metric that characterizes LVLM robustness at a granular level.
Score: 22.99984702966184
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

Related papers

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision [79.06371915084833]
We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
arXiv Detail & Related papers (2026-01-27T17:01:16Z)
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
Event-Priori-Based Vision-Language Model for Efficient Visual Understanding [13.540340702321911]
Event-Priori-Based Vision-Language Model (EP-VLM) improves VLM inference efficiency.<n>EP-VLM uses motion priors derived from dynamic event vision to enhance VLM efficiency.
arXiv Detail & Related papers (2025-06-09T10:45:35Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images.<n>Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.<n>We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.<n>Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions [73.97665608366447]
Large Vision-Language Models (LVLMs) have shown significant progress in responding well to visual-instructions from users.<n>These instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks.<n>We introduce B-AVIBench, a framework designed to analyze the robustness of LVLMs when facing various Black-box Adrial Visual-Instructions.
arXiv Detail & Related papers (2024-03-14T12:51:07Z)
Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.<n>We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.