Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
- URL: http://arxiv.org/abs/2511.10292v1
- Date: Fri, 14 Nov 2025 01:44:05 GMT
- Title: Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
- Authors: Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen,
- Abstract summary: Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs.<n>Existing inference-time interventions to mitigate this issue present a challenging trade-off.<n>We present Residual-Update Directed DEcoding Regulation (RUDDER), a framework that steers LVLMs towards visually-grounded generation.
- Score: 13.32858759983739
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs' reliability without a significant compromise on efficiency.
Related papers
- Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z) - PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z) - ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z) - Realism Control One-step Diffusion for Real-World Image Super-Resolution [21.13930153613271]
We propose a Realism Controlled One-step Diffusion (RCOD) framework for Real-ISR.<n>RCOD provides explicit control over fidelity-realism trade-offs during the noise prediction phase.<n>Our method achieves superior fidelity and perceptual quality while maintaining computational efficiency.
arXiv Detail & Related papers (2025-09-12T10:32:04Z) - ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration [8.192590936983347]
Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding.<n>They are frequently hampered by hallucination-the generation of text that contradicts visual input.<n>Existing training-free decoding strategies exhibit critical limitations.<n>This paper introduces Dynamic Logits (DLC), a novel training-free decoding framework designed to align text generation with visual evidence at inference time.
arXiv Detail & Related papers (2025-06-26T17:35:40Z) - Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning [23.851747078717473]
We introduce textbfValue-guided Inference with Margin-based Reward (ViMaR), a two-stage inference framework that improves both efficiency and output fidelity.<n>ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$times$ speedup.
arXiv Detail & Related papers (2025-06-18T17:23:36Z) - Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images.
textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z) - Distortion-Disentangled Contrastive Learning [13.27998440853596]
We propose a novel POCL framework named Distortion-Disentangled Contrastive Learning (DDCL) and a Distortion-Disentangled Loss (DDL)
Our approach is the first to explicitly disentangle and exploit the DVR inside the model and feature stream to improve the overall representation utilization efficiency, robustness and representation ability.
arXiv Detail & Related papers (2023-03-09T06:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.