Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
- URL: http://arxiv.org/abs/2510.18279v2
- Date: Wed, 22 Oct 2025 01:54:03 GMT
- Title: Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
- Authors: Yanhong Li, Zixuan Lan, Jiawei Zhou,
- Abstract summary: We show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs.<n>We exploit the idea of rendering long text inputs as a single image and provide it directly to the model.<n>This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression.
- Score: 14.784763071210014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.
Related papers
- Global Context Compression with Interleaved Vision-Text Transformation [12.971394377165767]
In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages.<n>We propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding.<n>With a 4$times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks.
arXiv Detail & Related papers (2026-01-15T13:29:16Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image.
We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP.
Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z) - Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved significant success in various tasks.<n>Main computational burden arises from processingd text and visual tokens.<n>We propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve.
arXiv Detail & Related papers (2024-09-02T10:49:10Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.<n>We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.<n>We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - Text Rendering Strategies for Pixel Language Models [21.36370101063954]
In this paper, we investigate four approaches to rendering text in the PIXEL model.
We find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks.
Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias.
arXiv Detail & Related papers (2023-11-01T13:49:31Z) - PuMer: Pruning and Merging Tokens for Efficient Vision Language Models [41.81484883647005]
PuMer is a framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text.
PuMer inference increases throughput by up to 2x and reduces memory footprint by over 50% while incurring less than a 1% accuracy drop.
arXiv Detail & Related papers (2023-05-27T17:16:27Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.