Related papers: Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

URL: http://arxiv.org/abs/2509.19191v1
Date: Tue, 23 Sep 2025 16:07:18 GMT
Title: Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models
Authors: Yueyan Li, Chenggong Zhao, Zeyuan Zang, Caixia Yuan, Xiaojie Wang,
Abstract summary: Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks.<n>These models typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision.<n>We introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency.
Score: 9.24989979549793
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.

Related papers

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation [52.605647992080485]
spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
arXiv Detail & Related papers (2026-01-05T10:38:26Z)
Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z)
Image Reconstruction as a Tool for Feature Analysis [2.0249250133493195]
We propose a novel approach for interpreting vision features via image reconstruction.<n>We show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks.<n>Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space.
arXiv Detail & Related papers (2025-06-09T14:32:18Z)
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [89.77871049500546]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z)
Thinking with Generated Images [30.28526622443551]
We present Thinking with Generated Images, a novel paradigm that transforms how large multimodal models (LMMs) engage with visual reasoning.<n>Our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking.
arXiv Detail & Related papers (2025-05-28T16:12:45Z)
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference [28.24397677839652]
Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models.<n>How MLLMs process and utilize visual information remains unclear.<n>We propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance.
arXiv Detail & Related papers (2025-03-17T12:31:23Z)
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding [79.43306110124875]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.<n>Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z)
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
What's in the Image? A Deep-Dive into the Vision of Vision Language Models [20.669971132114195]
Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data.
arXiv Detail & Related papers (2024-11-26T14:59:06Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Question Aware Vision Transformer for Multimodal Reasoning [14.188369270753347]
We introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning. It embeds question awareness directly within the vision encoder. This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
arXiv Detail & Related papers (2024-02-08T08:03:39Z)
Vision-Language Models for Vision Tasks: A Survey [62.543250338410836]
Vision-Language Models (VLMs) learn rich vision-language correlation from web-scale image-text pairs. This paper provides a systematic review of visual language models for various visual recognition tasks.
arXiv Detail & Related papers (2023-04-03T02:17:05Z)
ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations [77.3590853897664]
This work presents a self-supervised method to learn dense semantically rich visual embeddings for images inspired by methods for learning word embeddings in NLP.
arXiv Detail & Related papers (2021-11-24T12:27:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.