Related papers: VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

URL: http://arxiv.org/abs/2512.15649v2
Date: Tue, 23 Dec 2025 03:03:58 GMT
Title: VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
Authors: Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang,
Abstract summary: Vision-text compression (VTC) converts long texts into dense 2D visual representations.<n>The impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated.<n>This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
Score: 43.88970987769102
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-processed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

Related papers

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning [55.17170420615628]
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks.<n>We propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process.<n>Our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency.
arXiv Detail & Related papers (2026-01-29T18:07:39Z)
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens [54.18057944158818]
Chain-of-Visual-Thought (COVT) is a framework that enables Vision-Language Models (VLMs) to reason through continuous visual tokens.<n>Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts.<n>During training, the VLM with COVT autoregressively predicts visual tokens to reconstruct dense supervision signals.
arXiv Detail & Related papers (2025-11-24T18:55:19Z)
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding [56.45689495743107]
Vgent is a graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding.<n>We evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks.
arXiv Detail & Related papers (2025-10-15T19:14:58Z)
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding [29.719450799231705]
Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input.<n>Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets.<n>We propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism.
arXiv Detail & Related papers (2025-04-09T12:51:10Z)
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing [30.94114120434789]
We propose KVTP (Keyframe-oriented Vision Token MME), a novel framework that overcomes the token pruning and selection drawbacks.<n> KVTP effectively retains essential contextual information while significantly reducing redundant computation.
arXiv Detail & Related papers (2025-03-13T17:47:52Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.