Related papers: Same Answer, Different Representations: Hidden instability in VLMs

Same Answer, Different Representations: Hidden instability in VLMs

URL: http://arxiv.org/abs/2602.06652v1
Date: Fri, 06 Feb 2026 12:24:26 GMT
Title: Same Answer, Different Representations: Hidden instability in VLMs
Authors: Farooq Ahmad Wani, Alessandro Suglia, Rohit Saxena, Aryo Pradipta Gema, Wai-Chung Kwan, Fazl Barez, Maria Sofia Bucarelli, Fabrizio Silvestri, Pasquale Minervini,
Abstract summary: We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness.<n>We apply this framework to modern Vision Language Models (VLMs) across the SEEDBench, MMMU, and POPE datasets.
Score: 65.36933543377346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.

Related papers

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z)
Questioning the Stability of Visual Question Answering [11.848401203578456]
Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood.<n>We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations.<n>We show that state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings.
arXiv Detail & Related papers (2025-11-14T12:05:05Z)
RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation [67.38036090822982]
We propose RoboView-Bias, the first benchmark specifically designed to quantify visual bias in robotic manipulation.<n>We create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions.<n>Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.
arXiv Detail & Related papers (2025-09-26T13:53:25Z)
Evaluating Robustness of Vision-Language Models Under Noisy Conditions [0.0176290054713643]
Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering.<n>We present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations.
arXiv Detail & Related papers (2025-09-15T22:31:21Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts [78.79936076607373]
We introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify robustness of image classifiers for continuous and realistic nuisance shifts.<n>We propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models.
arXiv Detail & Related papers (2025-07-23T16:15:48Z)
Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z)
Shaking to Reveal: Perturbation-Based Detection of LLM Hallucinations [25.18901449626428]
A widely adopted strategy to detect hallucination, known as self-assessment, relies on the model's own output confidence to estimate the factual accuracy of its answers.<n>We propose Sample-Specific Prompting (SSP), a new framework that improves self-assessment by analyzing perturbation sensitivity at intermediate representations.<n>SSP significantly outperforms prior methods across a range of hallucination detection benchmarks.
arXiv Detail & Related papers (2025-06-03T09:44:28Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
Are vision language models robust to uncertain inputs? [5.249651874118556]
We show that newer and larger vision language models exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions.<n>For natural images such as ImageNet, this limitation can be overcome without pipeline modifications.<n>We propose a novel mechanism based on caption diversity to reveal a model's internal uncertainty.
arXiv Detail & Related papers (2025-05-17T03:16:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.