MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation
- URL: http://arxiv.org/abs/2506.17664v1
- Date: Sat, 21 Jun 2025 09:49:16 GMT
- Title: MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation
- Authors: Shuaiye Lu, Linjiang Zhou, Xiaochuan Shi,
- Abstract summary: Memory-Driven Sparse Attention Matrix (MDSAM) is a training-free approach that dynamically captures and refines the attention allocated to image tokens at each layer.<n>MDSAM memorizes attention patterns and activates updates through alignment during decoding, enhancing focus on relevant image tokens while effectively reducing hallucinations.
- Score: 0.11704154007740833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hallucinations in large vision-language models (LVLMs) often stem from the model's sensitivity to image tokens during decoding, as evidenced by attention peaks observed when generating both real and hallucinated entities. To address this, we propose Memory-Driven Sparse Attention Matrix (MDSAM) , a novel training-free approach that dynamically captures and refines the attention allocated to image tokens at each layer. MDSAM memorizes attention patterns and activates updates through alignment during decoding, enhancing focus on relevant image tokens while effectively reducing hallucinations. We evaluate MDSAM on multiple benchmarks for tasks such as image captioning and visual question answering, demonstrating its ability to consistently reduce hallucinations and improve reliability. Compatible with various LVLM architectures, MDSAM highlights its adaptability and effectiveness in mitigating hallucinations without requiring additional training or external tools.
Related papers
- Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation [51.743225614196774]
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning.<n>They remain vulnerable to hallucination, where generated content deviates from visual evidence.<n>Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding.<n>We propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs.
arXiv Detail & Related papers (2026-02-27T14:18:51Z) - Context-Aware Decoding for Faithful Vision-Language Generation [5.258492912374723]
Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs)<n>We probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy.
arXiv Detail & Related papers (2026-01-09T16:50:57Z) - SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination [48.601385640941935]
We propose SAVE, a framework that mitigates hallucination by steering the model along Sparse Autoencoder latent features.<n>A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing.<n>With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks.
arXiv Detail & Related papers (2025-12-08T17:20:07Z) - MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding [53.068815533016355]
We propose image head Masked Contrastive Decoding (MaskCD) for large vision-language models (LVLMs)<n>Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding.<n>The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs.
arXiv Detail & Related papers (2025-10-03T07:59:16Z) - SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision [59.61988843996952]
Style-Aware Visual Early Revision SAVER is a novel mechanism that dynamically adjusts LVLMs' final outputs based on the token-level visual attention patterns.<n>We show that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.
arXiv Detail & Related papers (2025-08-05T07:41:25Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z) - TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection [6.006482486396196]
We propose Temporal Attention Real-time Accumulative Connection (TARAC) to mitigate hallucinations caused by the decay of attention on image tokens.<n>We validate TARAC across multiple models and datasets, demonstrating that our approach substantially mitigates hallucinations.
arXiv Detail & Related papers (2025-04-05T07:57:11Z) - Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation [123.54980913741828]
Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations.<n>We propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID)<n>AID identifies Attention Hijackers by calculating instruction-driven visual salience.<n>Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers.<n>Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects.
arXiv Detail & Related papers (2025-03-11T09:35:55Z) - Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z) - MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction [6.416957959150438]
Hallucinations hinder the application of Large Vision-Language Models (LVLMs) in domains that require high reliability.<n>We propose MINT, a training-free decoding strategy, MItigating hallucinations via tokeN reducTion.<n>Our approach achieves a 4% improvement in mitigating hallucinations caused by distracted perception compared to original models.
arXiv Detail & Related papers (2025-02-02T08:34:57Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - Mitigating Object Hallucination via Concentric Causal Attention [71.27325347912823]
We show that object hallucination is closely tied with Rotary Position.
RoPE, a widely adopted positional dependency modeling design.
We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy.
arXiv Detail & Related papers (2024-10-21T11:54:53Z) - Reducing Hallucinations in Vision-Language Models via Latent Space Steering [34.1755878632361]
Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications.
We introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features.
arXiv Detail & Related papers (2024-10-21T08:42:30Z) - Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models [26.32657568461926]
multimodal large language models (MLLMs) are prone to hallucinations.<n>MemVR is a novel decoding paradigm inspired by common cognition.<n>MemVR significantly mitigates hallucination across various MLLMs.
arXiv Detail & Related papers (2024-10-04T16:30:54Z) - Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination [14.25488878224697]
We propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics.
Pensieve mitigates the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores.
arXiv Detail & Related papers (2024-03-21T13:49:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.