Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
- URL: http://arxiv.org/abs/2502.11501v1
- Date: Mon, 17 Feb 2025 07:05:36 GMT
- Title: Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
- Authors: Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, Linfeng Zhang,
- Abstract summary: Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs.
Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs.
In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.
- Score: 19.35502303812707
- License:
- Abstract: Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.
Related papers
- Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More [18.928285521147057]
We show that importance is not an ideal indicator to decide whether a token should be pruned.
We propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens.
Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance.
arXiv Detail & Related papers (2025-02-17T06:56:28Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [44.84219266082269]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.
We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)
We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.
PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Training Large Language Models to Reason in a Continuous Latent Space [84.5618790930725]
We introduce a new paradigm Coconut (Chain of Continuous Thought) to explore the potential of large language models (LLMs) reasoning in an unrestricted latent space.
Experiments show that Coconut can effectively augment the LLM on several reasoning tasks.
These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.
arXiv Detail & Related papers (2024-12-09T18:55:56Z) - Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [24.118503938098307]
Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation.
We propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding them.
arXiv Detail & Related papers (2024-12-03T08:29:27Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.
We present a novel framework for identifying these tokens through rollout sampling.
We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models [35.29961848648335]
Large Language Models (LLMs) have demonstrated impressive problem-solving capabilities in mathematics through step-by-step reasoning chains.
They are susceptible to reasoning errors that impact the quality of subsequent reasoning chains and the final answer due to their autoregressive token-by-token generating nature.
Recent works have proposed adopting external verifiers to guide the generation of reasoning paths, but existing works utilize models that have been trained with step-by-step labels.
arXiv Detail & Related papers (2024-07-12T13:16:50Z) - Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks.
We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks.
We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z) - LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts.
LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA.
Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z) - Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs.
computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging.
We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.