Related papers: TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

URL: http://arxiv.org/abs/2503.10501v1
Date: Thu, 13 Mar 2025 16:04:31 GMT
Title: TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
Authors: Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen,
Abstract summary: TokenCarve is a training-free, plug-and-play, two-stage token compression framework.<n>It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy.
Score: 8.636574530055817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at https://github.com/ShawnTan86/TokenCarve.

Related papers

Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models [50.214593234229255]
We introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. We propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs [38.34856927170692]
We propose a training-free framework for analyzing trained Multimodal Large Language Model (MLLM)<n>It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens.<n>Experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs.
arXiv Detail & Related papers (2025-01-31T11:09:16Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs) However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages. We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z)
Post-Training Sparse Attention with Double Sparsity [44.772593893621085]
"Double Sparsity" is a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. With offloading, it achieves a decoding speed acceleration of 16.3$times$ compared to state-of-the-art solutions at a sequence length of 256K.
arXiv Detail & Related papers (2024-08-11T18:40:36Z)
Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs) During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch. Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z)
Efficient Large Multi-modal Models via Visual Context Compression [23.966237939194514]
We present the study on the analysis of redundancy concerning visual tokens and efficient training within large language models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy. We introduce Visual Context on the GQA benchmark, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance.
arXiv Detail & Related papers (2024-06-28T17:57:14Z)
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism. We propose integrating two strategies: token pruning and token combining. Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z)
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning. It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning. Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.