Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
- URL: http://arxiv.org/abs/2506.09040v1
- Date: Tue, 10 Jun 2025 17:57:50 GMT
- Title: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
- Authors: Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang,
- Abstract summary: We introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework.<n>ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks.
- Score: 44.15671594378141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
Related papers
- Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models [93.46875303598577]
Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals remains underexplored.<n>This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts.
arXiv Detail & Related papers (2025-04-02T10:47:07Z) - Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)<n>Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.<n>We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z) - Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images [7.823336661261962]
Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors.<n>We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
arXiv Detail & Related papers (2025-02-19T18:05:42Z) - From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs [23.011836329934255]
Vision Dynamic Embedding-Guided Pretraining (VDEP) is a hybrid autoregressive training paradigm for MLLMs.<n>The proposed method seamlessly integrates into standard models without architectural changes.<n> Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.
arXiv Detail & Related papers (2025-02-13T09:04:28Z) - Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models [29.611769371733672]
We propose De Attention (D-Attn), a novel method that processes visual and textual embeddings differently.<n>D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $mathcalO(|V|2)$ to $mathcalO(|V|)$ for $|V|$ visual embeddings without compromising performance.
arXiv Detail & Related papers (2025-02-04T00:46:11Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.