HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models
- URL: http://arxiv.org/abs/2506.17587v1
- Date: Sat, 21 Jun 2025 04:56:55 GMT
- Title: HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models
- Authors: Le Yu, Kaishen Wang, Jianlong Xiong, Yue Cao, Tao He,
- Abstract summary: HalluRNN enhances model stability through recurrent cross-layer reasoning.<n>By fine-tuning only the DG-DPU module, HalluRNN achieves strong and robust performance across multiple benchmarks.
- Score: 11.826832299262199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though Large Vision-Language Models (LVLMs) have achieved remarkable performance across various tasks, they are still prone to hallucinations-generating outputs that are textually plausible but visually ungrounded. While prior approaches generally address this issue through data-centric fine-tuning or innovative decoding strategies, these methods often require substantial resources or task-specific configurations. In this work, we introduce an architecture-level solution, HalluRNN, which enhances model stability through recurrent cross-layer reasoning. Specifically, we propose a novel Dual-Gated Depth Propagation Unit (DG-DPU) module, which is shared across layers and recurrently refines hidden states. This allows for the adaptive propagation of information throughout the model, enforces consistency across layers, and mitigates hallucinations caused by representational drift. By fine-tuning only the DG-DPU module, HalluRNN achieves strong and robust performance across multiple benchmarks.
Related papers
- MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models [24.745502021162878]
Inverse problems (IPs) involve reconstructing signals from noisy observations.<n>DMs have emerged as a powerful framework for solving IPs, achieving remarkable reconstruction performance.<n>Existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence.<n>We propose two novel methods, DMILO and DMILO-PGD, to address these challenges.
arXiv Detail & Related papers (2025-05-27T06:49:02Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding [4.759109475818876]
Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains.<n>We introduce LIFT, a novel, high-performance framework that captures multiscale information through meta-learning.<n>We also introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings.
arXiv Detail & Related papers (2025-03-19T17:00:58Z) - Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization [18.855378039713678]
Large Vision Language Models (VLMs) are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies.<n>We introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset.<n>We also introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning.
arXiv Detail & Related papers (2025-02-18T18:59:57Z) - EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling [8.250616459360684]
We introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models.<n>Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding benchmark, and 3D first-person ViZDoom environments.
arXiv Detail & Related papers (2025-02-01T15:49:59Z) - Towards Scalable and Deep Graph Neural Networks via Noise Masking [59.058558158296265]
Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks.<n> scaling them to large graphs is challenging due to the high computational and storage costs.<n>We present random walk with noise masking (RMask), a plug-and-play module compatible with the existing model-simplification works.
arXiv Detail & Related papers (2024-12-19T07:48:14Z) - SurgeryV2: Bridging the Gap Between Model Merging and Multi-Task Learning with Deep Representation Surgery [54.866490321241905]
Model merging-based multitask learning (MTL) offers a promising approach for performing MTL by merging multiple expert models.
In this paper, we examine the merged model's representation distribution and uncover a critical issue of "representation bias"
This bias arises from a significant distribution gap between the representations of the merged and expert models, leading to the suboptimal performance of the merged MTL model.
arXiv Detail & Related papers (2024-10-18T11:49:40Z) - Diffusion Models Without Attention [110.5623058129782]
Diffusion State Space Model (DiffuSSM) is an architecture that supplants attention mechanisms with a more scalable state space model backbone.
Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward.
arXiv Detail & Related papers (2023-11-30T05:15:35Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - Accurate and Lightweight Image Super-Resolution with Model-Guided Deep
Unfolding Network [63.69237156340457]
We present and advocate an explainable approach toward SISR named model-guided deep unfolding network (MoG-DUN)
MoG-DUN is accurate (producing fewer aliasing artifacts), computationally efficient (with reduced model parameters), and versatile (capable of handling multiple degradations)
The superiority of the proposed MoG-DUN method to existing state-of-theart image methods including RCAN, SRDNF, and SRFBN is substantiated by extensive experiments on several popular datasets and various degradation scenarios.
arXiv Detail & Related papers (2020-09-14T08:23:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.