V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
- URL: http://arxiv.org/abs/2504.16727v2
- Date: Thu, 24 Apr 2025 02:18:01 GMT
- Title: V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
- Authors: Zhiyuan Fan, Yumeng Wang, Sandeep Polisetty, Yi R. Fung,
- Abstract summary: V$2$R-Bench is a benchmark framework for evaluating Visual Variation Robustness of LVLMs.<n>We show that advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition.<n>These vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment.
- Score: 1.7971686967440696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision Language Models (LVLMs) excel in various vision-language tasks. Yet, their robustness to visual variations in position, scale, orientation, and context that objects in natural scenes inevitably exhibit due to changes in viewpoint and environment remains largely underexplored. To bridge this gap, we introduce V$^2$R-Bench, a comprehensive benchmark framework for evaluating Visual Variation Robustness of LVLMs, which encompasses automated evaluation dataset generation and principled metrics for thorough robustness assessment. Through extensive evaluation on 21 LVLMs, we reveal a surprising vulnerability to visual variations, in which even advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition. Interestingly, these models exhibit a distinct visual position bias that contradicts theories of effective receptive fields, and demonstrate a human-like visual acuity threshold. To identify the source of these vulnerabilities, we present a systematic framework for component-level analysis, featuring a novel visualization approach for aligned visual features. Results show that these vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment. Complementary experiments with synthetic data further demonstrate that these limitations are fundamentally architectural deficiencies, scoring the need for architectural innovations in future LVLM designs.
Related papers
- Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images.<n>Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.<n>We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation [38.20492321295552]
Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks.<n>Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated.
arXiv Detail & Related papers (2025-04-13T08:28:13Z) - V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
V-MAGE is a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs.
We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning.
arXiv Detail & Related papers (2025-04-08T15:43:01Z) - Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [10.792834356227118]
Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning.<n>Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities.
arXiv Detail & Related papers (2025-03-21T17:51:14Z) - iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs [4.381263829108405]
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment.
We introduce iVISPAR, an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents.
arXiv Detail & Related papers (2025-02-05T14:29:01Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework using Bongard Problems (BPs) to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Our framework provides a valuable diagnostic tool, highlighting the need to enhance visual processing fidelity for achieving more robust and human-like visual intelligence in AI.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models [34.647839550142834]
We introduce GePBench, a novel benchmark designed to assess the geometric perception abilities of MLLMs.<n>Our evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks.<n>We show that models trained with GePBench data demonstrate substantial improvements on a wide range of benchmark tasks.
arXiv Detail & Related papers (2024-12-30T16:01:43Z) - Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance.
Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low.
We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.