Glyph: Scaling Context Windows via Visual-Text Compression
- URL: http://arxiv.org/abs/2510.17800v2
- Date: Tue, 21 Oct 2025 17:12:48 GMT
- Title: Glyph: Scaling Context Windows via Visual-Text Compression
- Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang,
- Abstract summary: Glyph is a framework that renders long texts into images and processes them with vision-language models.<n>Our method achieves 3-4x token compression while maintaining accuracy comparable to leading long-context models.<n>Under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks.
- Score: 91.20717058018745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
Related papers
- What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models [9.530137749236617]
Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models.<n>We introduce a two-fold analytical framework featuring a novel probing tool, $textLenEmbeds$, to conduct a fine-grained analysis.<n>We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories.
arXiv Detail & Related papers (2026-02-28T07:13:36Z) - Global Context Compression with Interleaved Vision-Text Transformation [12.971394377165767]
In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages.<n>We propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding.<n>With a 4$times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks.
arXiv Detail & Related papers (2026-01-15T13:29:16Z) - SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference [49.84148668264725]
We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages.<n>Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks.
arXiv Detail & Related papers (2025-10-20T17:35:47Z) - QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA [16.494799458292]
Images often contain more redundant information than text, and not all visual details are pertinent to specific questions.<n>We propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks.<n>QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space.
arXiv Detail & Related papers (2025-04-01T11:07:19Z) - Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z) - Visual Context Window Extension: A New Perspective for Long Video Understanding [45.134271969594614]
We tackle the challenge of long video understanding from the perspective of context windows.
We propose to adapt LMMs for long video understanding tasks by extending the visual context window.
Our method consistently improves the performance as the number of video frames increases.
arXiv Detail & Related papers (2024-09-30T07:25:16Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models [83.98062659664785]
Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures.
This work identifies three major factors contributing to this length generalization failure.
We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
arXiv Detail & Related papers (2023-08-30T16:47:51Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.