Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding
- URL: http://arxiv.org/abs/2506.09522v2
- Date: Sat, 11 Oct 2025 19:17:33 GMT
- Title: Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding
- Authors: Beomsik Cho, Jaehyung Kim,
- Abstract summary: Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding.<n>We propose ReVisiT, a training-free decoding method that references vision tokens to guide text generation.
- Score: 6.612630497074871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that references vision tokens to guide text generation. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization, and using its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT consistently enhances visual grounding with minimal computational overhead, and achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$.
Related papers
- ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning [8.933549837045932]
Large Vision-Language Models incur high computational costs due to significant redundancy in their visual tokens.<n>We propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the Large Language Models.
arXiv Detail & Related papers (2026-01-25T12:47:30Z) - CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models [75.88232735646018]
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos.<n>Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations.<n>We propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM.
arXiv Detail & Related papers (2025-08-24T07:47:00Z) - ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z) - Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference [28.24397677839652]
Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models.<n>How MLLMs process and utilize visual information remains unclear.<n>We propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance.
arXiv Detail & Related papers (2025-03-17T12:31:23Z) - D-Attn: Decomposed Attention for Large Vision-and-Language Models [29.611769371733672]
We propose Decomposed Attention (D-Attn), a more flexible attention architecture for large vision-and-language models (LVLMs)<n>D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions.<n>Experiments and analysis validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks.
arXiv Detail & Related papers (2025-02-04T00:46:11Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.