The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
- URL: http://arxiv.org/abs/2512.15949v1
- Date: Wed, 17 Dec 2025 20:22:23 GMT
- Title: The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
- Authors: Tejas Anvekar, Fenil Bardoliya, Pavan K. Turaga, Chitta Baral, Vivek Gupta,
- Abstract summary: We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like face matching and text-in-vision comprehension capabilities.<n>The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations.
- Score: 44.71703930770065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.
Related papers
- TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z) - Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment [7.969076042774561]
We introduce a low-level distortion perception task that requires models to classify specific distortion types.<n>Our analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates.<n>We show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%.
arXiv Detail & Related papers (2025-12-10T12:06:47Z) - LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z) - From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs [57.01486941224062]
Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks.<n>We focus on how models respond when identical key visual information is placed at different locations within an image.<n>We introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens.
arXiv Detail & Related papers (2025-09-26T07:07:03Z) - MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness [50.33343842822694]
We introduce MMPerspective, the first benchmark specifically designed to evaluate multimodal large language models' understanding of perspective.<n>Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities.<n>Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations.
arXiv Detail & Related papers (2025-05-26T18:20:22Z) - Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward [1.7971686967440696]
V$2$R-Bench is a benchmark framework for evaluating Visual Variation Robustness of LVLMs.<n>We show that advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition.<n>These vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment.
arXiv Detail & Related papers (2025-04-23T14:01:32Z) - Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [13.768090541138571]
Vision Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning.<n>Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens.<n>Tools uncover that vision tokens and system prompts dominate attention.
arXiv Detail & Related papers (2025-03-21T17:51:14Z) - GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning.
We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.