Related papers: Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?

Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?

URL: http://arxiv.org/abs/2502.11501v1
Date: Mon, 17 Feb 2025 07:05:36 GMT
Title: Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Authors: Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, Linfeng Zhang,
Abstract summary: Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs.<n>Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs.<n>In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.
Score: 19.35502303812707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.

Related papers

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? [72.70486097967124]
We formalize a framework using deterministic finite automata (DFAs) We show that there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized. We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
arXiv Detail & Related papers (2025-04-02T17:45:58Z)
Language Model Uncertainty Quantification with Attention Chain [9.093726246465117]
A large language model's (LLM) predictive uncertainty is crucial for judging the reliability of its answers. We propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization. We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs.
arXiv Detail & Related papers (2025-03-24T21:43:47Z)
TokenButler: Token Importance is Predictable [8.514853311344458]
Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. Prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. We introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens.
arXiv Detail & Related papers (2025-03-10T16:41:14Z)
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [60.04718679054704]
We introduce Sketch-of-Thought (SoT), a novel prompting framework. It combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize token usage. SoT achieves token reductions of 76% with negligible accuracy impact.
arXiv Detail & Related papers (2025-03-07T06:57:17Z)
Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More [18.928285521147057]
We show that importance is not an ideal indicator to decide whether a token should be pruned.<n>We propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens.<n>Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance.
arXiv Detail & Related papers (2025-02-17T06:56:28Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [44.84219266082269]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z)
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [24.118503938098307]
Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation.<n>We propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding them.
arXiv Detail & Related papers (2024-12-03T08:29:27Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models [35.29961848648335]
Large Language Models (LLMs) have demonstrated impressive problem-solving capabilities in mathematics through step-by-step reasoning chains. They are susceptible to reasoning errors that impact the quality of subsequent reasoning chains and the final answer due to their autoregressive token-by-token generating nature. Recent works have proposed adopting external verifiers to guide the generation of reasoning paths, but existing works utilize models that have been trained with step-by-step labels.
arXiv Detail & Related papers (2024-07-12T13:16:50Z)
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks. We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks. We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Guiding Language Model Reasoning with Planning Tokens [122.43639723387516]
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks. We propose a hierarchical generation scheme to encourage a more structural generation of chain-of-thought steps. Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.