Related papers: LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models

LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models

URL: http://arxiv.org/abs/2507.19110v1
Date: Fri, 25 Jul 2025 09:48:23 GMT
Title: LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models
Authors: Zhihui Guo, Xin Man, Hui Xu, Jie Shao,
Abstract summary: Multimodal Large Language Models (MLLMs) excel in vision-language tasks but remain prone to object hallucinations.<n>We propose textbfLISA, which enhances generation consistency through hierarchical modulation and multi-layer fusion.<n>Experiments show that LISA reduces hallucinations by up to 53.6% in $mathrmCHAIR_I$ and improves POPE F1 by 4.5%.
Score: 8.122679857175315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose \textbf{LISA}, a \textbf{L}ayer-wise \textbf{I}ntegration and \textbf{S}uppression \textbf{A}pproach that enhances generation consistency through hierarchical modulation and multi-layer fusion. LISA leverages the functional hierarchy within MLLMs, where shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, zone-specific spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully \textbf{plug-and-play} and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6\% in $\mathrm{CHAIR}_I$ and improves POPE F1 by 4.5\%, demonstrating strong generalization across models and tasks.

Related papers

From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z)
PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs [59.78917775399492]
Multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability.<n>We propose a training-free framework to mitigate this degradation.
arXiv Detail & Related papers (2026-01-12T15:27:51Z)
Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs [25.843085393058434]
TGIF (Text-Guided Inter-layer Fusion) is a lightweight module that treats encoder layers as depth-wise "experts"<n> TGIF provides consistent improvements across hallucination, OCR, and VQA benchmarks.
arXiv Detail & Related papers (2026-01-06T15:31:19Z)
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z)
Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection [65.29550320117526]
We propose a novel framework named FineGrainedAD to improve anomaly localization performance.<n> Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings.
arXiv Detail & Related papers (2025-10-30T13:09:00Z)
D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs [15.665202830841046]
This work revisits hallucination detection from the perspective of model architecture and generation dynamics.<n>We propose textbfD$2$HScore (Dispersion and Drift-based Hallucination Score), a training-free and label-free framework.<n>Experiments across five open-source Language Models and five widely used benchmarks demonstrate that D$2$HScore consistently outperforms existing training-free baselines.
arXiv Detail & Related papers (2025-09-15T04:28:38Z)
Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning [5.85033069870214]
We propose an efficient vision-language fine-tuning method based on dynamic embedding and fusion of hierarchical visual features.<n>By fine-tuning only a small number of parameters, DEHVF achieves precise alignment and complement of cross-modal information.
arXiv Detail & Related papers (2025-08-25T03:57:46Z)
ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way [7.18701660596182]
ByDeWay is a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs)<n>ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP)<n>It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model.
arXiv Detail & Related papers (2025-07-11T15:21:49Z)
Rethinking Visual Layer Selection in Multimodal LLMs [46.091556112958884]
This work proposes a Layer-wise Similarity approach to group CLIP-ViT layers with similar behaviors into shallow, middle, and deep categories.<n>We revisit the visual layer selection problem in MLLMs at scale, training LLaVA-style models ranging from 1.4B to 7B parameters.<n>We find that: (1) deep layers are essential for OCR tasks; (2) shallow and middle layers substantially outperform deep layers on reasoning tasks involving counting, positioning, and object localization; and (3) a lightweight fusion of features across shallow, middle, and deep layers consistently outperforms specialized fusion baselines and single-
arXiv Detail & Related papers (2025-04-30T09:07:10Z)
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens [66.02261367232256]
Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation.<n>Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order.<n>In this paper, we build a proper visual language by reconstructing diffusion timesteps to learn discrete visual tokens.
arXiv Detail & Related papers (2025-04-20T16:14:28Z)
LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy [33.85811169010525]
Large language models (LLMs) exhibit suboptimal performance on low-resource languages.<n>Recent approaches have leveraged multilingual encoders alongside LLMs by introducing trainable parameters connecting the two models.<n>We propose aname, a framework that integrates representations from all encoder layers.
arXiv Detail & Related papers (2025-02-17T03:45:03Z)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z)
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer [110.39467860530819]
Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding.<n>We present LLaVA-UHD v2, an MLLM with advanced perception abilities by introducing a well-designed vision-language projector.<n>Hiwin transformer enhances MLLM's ability to capture diverse multi-modal visual granularities, by incorporating our constructed high-resolution semantic pyramid.
arXiv Detail & Related papers (2024-12-18T14:07:46Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment [39.870809905905325]
We propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA) to extract fine-grained visual information. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference.
arXiv Detail & Related papers (2024-10-08T11:41:55Z)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.<n>VTW strategically withdraws vision tokens at a certain layer, enabling only text tokens to engage in subsequent layers.<n>Our approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.