Global Context Compression with Interleaved Vision-Text Transformation
- URL: http://arxiv.org/abs/2601.10378v2
- Date: Sat, 17 Jan 2026 02:11:12 GMT
- Title: Global Context Compression with Interleaved Vision-Text Transformation
- Authors: Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang,
- Abstract summary: In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages.<n>We propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding.<n>With a 4$times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks.
- Score: 12.971394377165767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.
Related papers
- Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs [14.784763071210014]
We show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs.<n>We exploit the idea of rendering long text inputs as a single image and provide it directly to the model.<n>This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression.
arXiv Detail & Related papers (2025-10-21T04:07:20Z) - Glyph: Scaling Context Windows via Visual-Text Compression [91.20717058018745]
Glyph is a framework that renders long texts into images and processes them with vision-language models.<n>Our method achieves 3-4x token compression while maintaining accuracy comparable to leading long-context models.<n>Under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks.
arXiv Detail & Related papers (2025-10-20T17:58:56Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z) - Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck [40.21228703978429]
We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner.<n>Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks.
arXiv Detail & Related papers (2025-03-27T17:57:07Z) - Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z) - SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization [20.109136454526233]
We propose SweetTok, a novel video tokenizer to overcome the limitations in current video tokenization methods.<n>SweetTok compress visual inputs through distinct spatial and temporal queries via textbfDecoupled textbfAutotextbfEncoder (DQAE)<n>We show that SweetTok significantly improves video reconstruction results by textbf42.8% w.r.t rFVD on UCF-101 dataset.
arXiv Detail & Related papers (2024-12-11T13:48:06Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Efficient Large Multi-modal Models via Visual Context Compression [23.966237939194514]
We present the study on the analysis of redundancy concerning visual tokens and efficient training within large language models.
Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy.
We introduce Visual Context on the GQA benchmark, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance.
arXiv Detail & Related papers (2024-06-28T17:57:14Z) - Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.<n>The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.<n>Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.