Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
- URL: http://arxiv.org/abs/2509.03113v2
- Date: Wed, 01 Oct 2025 07:53:37 GMT
- Title: Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
- Authors: Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li, Jose M. Alvarez,
- Abstract summary: We propose a Gradient-based Influence-Aware Constrained Decoding (GACD) method to address text-visual bias and co-occurrence bias.<n>GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
- Score: 49.26064449816502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models achieve strong performance across diverse tasks but remain prone to hallucinations, where outputs are not grounded in visual inputs. This issue can be attributed to two main biases: text-visual bias, the overreliance on prompts and prior outputs, and co-occurrence bias, spurious correlations between frequently paired objects. We propose Gradient-based Influence-Aware Constrained Decoding (GACD), an inference-based method, that addresses both biases without auxiliary models, and is readily applicable to existing models without finetuning. The core of our approach is bias estimation, which uses first-order Taylor gradients to understand the contribution of individual tokens-visual features and text tokens-to the current output. Based on this analysis, GACD mitigates hallucinations through two components: (1) suppressing spurious visual features correlated with the output objects, and (2) rebalancing cross-modal contributions by strengthening visual features relative to text. Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
Related papers
- Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization [78.94590726578014]
multimodal reasoning models (MLRMs) remain prone to hallucinations, and effective solutions are still underexplored.<n>We propose C3PO, a training-based mitigation framework comprising textbfCompression and textbfPreference textbfOptimization.
arXiv Detail & Related papers (2026-02-03T11:00:55Z) - Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs [9.043999205886658]
Hallucinations in large vision-language models often arise when language priors dominate over visual evidence.<n>We propose Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths.<n>ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost.
arXiv Detail & Related papers (2026-01-20T08:04:18Z) - Semantic Item Graph Enhancement for Multimodal Recommendation [49.66272783945571]
Multimodal recommendation systems have attracted increasing attention for their improved performance by leveraging items' multimodal information.<n>Prior methods often build modality-specific item-item semantic graphs from raw modality features.<n>These semantic graphs suffer from semantic deficiencies, including insufficient modeling of collaborative signals among items.
arXiv Detail & Related papers (2025-08-08T09:20:50Z) - Mitigating Object Hallucinations via Sentence-Level Early Intervention [10.642552315531404]
Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations.<n>We propose SENTINEL, a framework that eliminates dependency on human annotations.<n>Sentence-level Early iNtervention Through IN-domain prEference Learning can reduce hallucinations by over 90% compared to the original model.
arXiv Detail & Related papers (2025-07-16T17:55:43Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models [28.24397677839652]
Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs)<n>We propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model's middle layers.<n>VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.
arXiv Detail & Related papers (2025-03-17T12:30:40Z) - Treble Counterfactual VLMs: A Causal Approach to Hallucination [6.3952983618258665]
VisionLanguage Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning.<n>They often generate hallucinated outputs inconsistent with the visual context or prompt.<n>Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding.
arXiv Detail & Related papers (2025-03-08T11:13:05Z) - Dissecting Representation Misalignment in Contrastive Learning via Influence Function [15.28417468377201]
We introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss.<n>ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models.<n>Building upon ECIF, we develop a series of algorithms for data evaluation, misalignment detection, and misprediction trace-back tasks.
arXiv Detail & Related papers (2024-11-18T15:45:41Z) - Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation [53.27596811146316]
Diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts.
We present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep.
We introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest.
arXiv Detail & Related papers (2024-01-17T07:58:18Z) - Data augmentation and explainability for bias discovery and mitigation
in deep learning [0.0]
This dissertation explores the impact of bias in deep neural networks and presents methods for reducing its influence on model performance.
The first part begins by categorizing and describing potential sources of bias and errors in data and models, with a particular focus on bias in machine learning pipelines.
The next chapter outlines a taxonomy and methods of Explainable AI as a way to justify predictions and control and improve the model.
arXiv Detail & Related papers (2023-08-18T11:02:27Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.