Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict
- URL: http://arxiv.org/abs/2504.08974v1
- Date: Fri, 11 Apr 2025 20:56:52 GMT
- Title: Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict
- Authors: Pouya Pezeshkpour, Moin Aminnaseri, Estevam Hruschka,
- Abstract summary: We show how vision-language models (VLMs) reason when confronted with scenarios that present conflicting image and text cues.<n>Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases.<n>In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information.
- Score: 3.9495397165148023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues, a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model's overall performance on the task and the specific modality in question.
Related papers
- VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search [20.695290280579858]
This paper presents the HFUT-LMC team's solution to the WWW 2025 challenge on Text-based Person Anomaly Search (TPAS)<n>The primary objective of this challenge is to accurately identify pedestrians exhibiting either normal or abnormal behavior within a large library of pedestrian images.<n>We introduce the Similarity Coverage Analysis (SCA) strategy to address the recognition difficulty caused by similar text descriptions.
arXiv Detail & Related papers (2025-02-05T14:45:09Z) - Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities [18.859309032300402]
We investigate how the integration of information from image and text modalities influences the performance and behavior of Visual Language Model (VLM) predictions.
We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task.
Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence.
arXiv Detail & Related papers (2024-10-02T16:02:02Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective [44.045767657945895]
We examine the brittleness of the image-text retrieval (ITR) evaluation pipeline with a focus on concept granularity.
We evaluate four diverse state-of-the-art Vision-Language models on both the standard and fine-grained datasets under zero-shot conditions.
The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts.
arXiv Detail & Related papers (2024-07-21T18:08:44Z) - Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective [13.486497323758226]
Vision-language models pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with objects or scenarios.
We propose a framework that incorporates causal mediation analysis to measure and map the pathways of bias generation and propagation.
arXiv Detail & Related papers (2024-07-03T05:19:45Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text.
A Vision-Language-Consistency Analysis of VLLMs and Beyond [7.760124498553333]
We study whether vision-language models execute vision and language tasks consistently or independently.
We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting.
We introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
arXiv Detail & Related papers (2023-10-19T06:45:11Z) - Enhancing Argument Structure Extraction with Efficient Leverage of
Contextual Information [79.06082391992545]
We propose an Efficient Context-aware model (ECASE) that fully exploits contextual information.
We introduce a sequence-attention module and distance-weighted similarity loss to aggregate contextual information and argumentative information.
Our experiments on five datasets from various domains demonstrate that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-10-08T08:47:10Z) - Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking [0.5242869847419834]
We propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy.
To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution.
We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets.
arXiv Detail & Related papers (2023-09-15T04:39:11Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.