Related papers: Investigating Spatial Attention Bias in Vision-Language Models

Investigating Spatial Attention Bias in Vision-Language Models

URL: http://arxiv.org/abs/2512.18231v1
Date: Sat, 20 Dec 2025 06:22:38 GMT
Title: Investigating Spatial Attention Bias in Vision-Language Models
Authors: Aryan Chaudhary, Sanchit Goyal, Pratik Narang, Dhruv Kumar,
Abstract summary: This work identifies and characterizes a systematic spatial attention bias in Vision-Language Models (VLMs)<n>We demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases.<n>Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause.
Score: 8.387055152856824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models have demonstrated remarkable capabilities in understanding visual content, yet systematic biases in their spatial processing remain largely unexplored. This work identifies and characterizes a systematic spatial attention bias where VLMs consistently prioritize describing left-positioned content before right-positioned content in horizontally concatenated images. Through controlled experiments on image pairs using both open-source and closed-source models, we demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases under neutral prompting conditions. Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause. Investigation of training dataset annotation guidelines from PixMo and Visual Genome reveals no explicit left-first ordering instructions, suggesting the bias is consistent with architectural factors rather than explicit training data instructions. These findings reveal fundamental limitations in how current VLMs process spatial information.

Related papers

Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy [59.44168425139687]
BayesVLA is a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify.<n>Experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.
arXiv Detail & Related papers (2025-12-12T01:59:23Z)
Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis [19.111897718147656]
Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data.<n>We propose that the bias originates from the model's internal architecture.
arXiv Detail & Related papers (2025-10-30T17:22:22Z)
From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs [57.01486941224062]
Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks.<n>We focus on how models respond when identical key visual information is placed at different locations within an image.<n>We introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens.
arXiv Detail & Related papers (2025-09-26T07:07:03Z)
Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation [1.7997395646080083]
Large Vision Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet they also exhibit notable social biases.<n>We propose an explanatory framework that combines information flow analysis with multi-round dialogue evaluation.<n>Experiments reveal that LVLMs exhibit systematic disparities in information usage when processing images of different demographic groups.
arXiv Detail & Related papers (2025-05-27T12:28:44Z)
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [63.54377402784965]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN)<n>Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners.<n> Experiments on both the discrete environments (R2R, REVERIE, and R4R dataset) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z)
Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.