Related papers: TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

URL: http://arxiv.org/abs/2512.22748v2
Date: Wed, 31 Dec 2025 07:27:04 GMT
Title: TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts
Authors: Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin,
Abstract summary: Growing number of visual tokens greatly increases inference cost.<n>Visual token pruning has emerged as a promising solution.<n>Our approach can reduce up to 80% of visual tokens while maintaining performance in long context settings.
Score: 6.465999214817427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach can reduce up to 80% of visual tokens while maintaining performance in long context settings.

Related papers

Towards Generalized Multi-Image Editing for Unified Multimodal Models [56.620038824933566]
Unified Multimodal Models (UMMs) integrate multimodal understanding and generation.<n>UMMs are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images.<n>We propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts.
arXiv Detail & Related papers (2026-01-09T06:42:49Z)
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution [71.69364653858447]
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs.<n>We propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying complexities using different numbers of vision tokens.<n> Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities.
arXiv Detail & Related papers (2025-10-14T17:58:10Z)
TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models [4.779482139419908]
We introduce a mutual information-based token pruning strategy that removes visual tokens semantically with textual tokens.<n>Our method maintains strong performance while reducing textual tokens by 88.9% on models such as LLaVA-15-7B and LLaVA--7B.
arXiv Detail & Related papers (2025-08-30T02:43:50Z)
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations [33.11867433769496]
This paper presents a framework that attempts to unify visual understanding and generation within a shared semantic representation.<n>At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary.<n> Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency.
arXiv Detail & Related papers (2025-06-23T17:59:14Z)
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z)
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization [30.73986620551153]
Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens.<n>Previous approaches have attempted to reduce the number of image tokens through token pruning.<n>We propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens.
arXiv Detail & Related papers (2025-05-28T07:00:50Z)
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models [12.265270657795275]
ImageChain is a framework that enhances MLLMs with sequential reasoning capabilities over image data.<n>Our approach improves performance on the next-scene description task.<n>ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics.
arXiv Detail & Related papers (2025-02-26T18:55:06Z)
ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.<n>There exists a trade-off between reconstruction and generation quality regarding token length.<n>We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z)
Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens. We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z)
MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.<n>We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.<n>Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.