Related papers: Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

URL: http://arxiv.org/abs/2503.13792v1
Date: Tue, 18 Mar 2025 00:45:02 GMT
Title: Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Jing Zhang,
Abstract summary: We show that Large Vision-Language Models (LVLMs) struggle to robustly utilize information across multiple images.<n>We propose SoFt Attention (SoFA), a training-free approach that mitigates this bias.<n> Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.
Score: 8.477985931416303
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.

Related papers

Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching [31.42132290162457]
We introduce a new framework called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts.<n>Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models.<n>Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior 12% improvement in IMIM indicates our method efficiently mitigates the misalignment.
arXiv Detail & Related papers (2025-07-14T14:28:15Z)
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning [50.21619363035618]
We propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks.<n>We introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity.<n>Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin.
arXiv Detail & Related papers (2025-06-17T18:25:56Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages. TVC helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration [22.39558434131574]
Large Vision-Language Models (LVLMs) generate responses that are not factually aligned with the visual content.<n>We introduce a training-free solution, Uniform Attention (UAC), that estimates the bias from single meaningless input image.<n>We also introduce a fine-tuning solution, Dynamic Attention (DAC), that enforces the consistent outputs wherever the object locates in the image.
arXiv Detail & Related papers (2025-02-04T03:27:38Z)
Modality-Fair Preference Optimization for Trustworthy MLLM Alignment [22.093944381988496]
Multimodal large language models (MLLMs) have achieved remarkable success across various tasks.<n>However, separate training of visual and textual encoders often results in a misalignment of the modality.<n>These inaccuracies severely undermine the trustworthiness of MLLMs in real-world applications.
arXiv Detail & Related papers (2024-10-20T08:56:52Z)
Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective [13.486497323758226]
Vision-language models pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with objects or scenarios. We propose a framework that incorporates causal mediation analysis to measure and map the pathways of bias generation and propagation.
arXiv Detail & Related papers (2024-07-03T05:19:45Z)
Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs) Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z)
From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks [33.476693301050275]
We conduct experiments with truncation strategies across various LVLMs for visual question answering and image captioning tasks. By exploring the information flow from the perspective of visual representation contribution, we observe that it tends to converge in shallow layers but diversify in deeper layers.
arXiv Detail & Related papers (2024-06-04T13:52:54Z)
Unveiling the Tapestry of Consistency in Large Vision-Language Models [25.106467574467448]
We provide a benchmark ConBench to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain.
arXiv Detail & Related papers (2024-05-23T04:08:23Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments. We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z)
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition. Our method uses the attention mechanism to correlate multiple images within a batch. Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z)
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.