Related papers: Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

URL: http://arxiv.org/abs/2504.18579v3
Date: Mon, 29 Sep 2025 03:41:48 GMT
Title: Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
Authors: Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu,
Abstract summary: We explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named textitSparsity Forcing.<n>Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards.
Score: 40.93786579652003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model's inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named \textit{Sparsity Forcing}. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20\% to 75\% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.

Related papers

Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank [65.00301565190824]
mname is a plug-and-play training framework that requires no external encoders.<n>mname achieves a state-of-the-art FID of textbf2.40 within 400k steps, significantly outperforming comparable methods.
arXiv Detail & Related papers (2025-12-09T14:39:26Z)
TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z)
e1: Learning Adaptive Control of Reasoning Effort [88.51897900019485]
Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning.<n>Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost.<n>We propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens.
arXiv Detail & Related papers (2025-10-30T23:12:21Z)
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning [6.468843780300177]
We present textbfDELTA, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy.<n>Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.
arXiv Detail & Related papers (2025-10-10T21:37:49Z)
Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference [7.690958366125321]
This paper introduces informed routing, a new paradigm that proactively addresses these issues.<n>We propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made.<n>Experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs.
arXiv Detail & Related papers (2025-10-10T09:59:36Z)
Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning [42.82825782517565]
Posttraining DeepScaleR-1.5B with ALP cuts average token usage by 50% without significantly dropping performance.<n>Relative to fixed-budget and uniform penalty baselines, ALP redistributes its reduced budget more intelligently by cutting compute on easy prompts and reallocating saved tokens to difficult ones.
arXiv Detail & Related papers (2025-06-05T17:17:05Z)
ACE: Exploring Activation Cosine Similarity and Variance for Accurate and Calibration-Efficient LLM Pruning [15.933542902352604]
We propose an efficient and effective pruning method that simultaneously achieves high pruning performance and fast pruning speed.<n> Experimental results show that our method achieves up to an 18% reduction in perplexity and up to 63% decrease in pruning time on prevalent LLMs.
arXiv Detail & Related papers (2025-05-28T05:25:16Z)
The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training [63.99981166397423]
Recent large language models (LLMs) exhibit impressive reasoning but often over-think, generating excessively long responses that hinder efficiency.<n>We introduce DIET, a framework that systematically cuts these "token calories" by integrating on-the-fly problem difficulty into the reinforcement learning process.<n> DIET dynamically adapts token compression strategies by modulating token penalty strength and conditioning target lengths on estimated task difficulty.
arXiv Detail & Related papers (2025-05-25T16:24:12Z)
COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection [3.647905567437244]
sparse activation methods selectively deactivate non-essential parameters during inference.<n>We propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination.<n>Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.
arXiv Detail & Related papers (2025-05-23T10:10:22Z)
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z)
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Leveraging the true depth of LLMs [46.81174316936993]
Large Language Models (LLMs) demonstrate remarkable capabilities at the cost of high compute requirements.<n>Recent studies have demonstrated that intermediate layers in LLMs can be removed or reordered without substantial accuracy loss.<n>We propose a novel method that groups consecutive layers into pairs evaluated in parallel.
arXiv Detail & Related papers (2025-02-05T00:26:27Z)
Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment. We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z)
Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images. textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z)
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [7.8430836312711465]
This paper reformulates the activation sparsification problem to explicitly capture the relationship between activation sparsity and model performance.<n>We propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification.<n> Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over eight downstream tasks while activating fewer parameters than existing methods.
arXiv Detail & Related papers (2024-09-02T16:41:44Z)
LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights [2.8461446020965435]
We introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing Latent Diffusion Models. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG)
arXiv Detail & Related papers (2024-04-18T06:35:37Z)
Two Counterexamples to Tokenization and the Noiseless Channel [24.127593302335164]
In Tokenization and the Noiseless Channel, R'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that R'enyi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase R'enyi efficiency while decreasing the downstream model performance.
arXiv Detail & Related papers (2024-02-22T15:03:25Z)
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs) LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)
PoseRAC: Pose Saliency Transformer for Repetitive Action Counting [56.34379680390869]
We introduce Pose Saliency Representation, which efficiently represents each action using only two salient poses instead of redundant frames. We also introduce PoseRAC, which is based on this representation and achieves state-of-the-art performance. Our lightweight model is highly efficient, requiring only 20 minutes for training on a GPU, and infers nearly 10x faster compared to previous methods.
arXiv Detail & Related papers (2023-03-15T08:51:17Z)
Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study [25.58608455210458]
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder.
arXiv Detail & Related papers (2023-03-12T19:52:34Z)
FasterPose: A Faster Simple Baseline for Human Pose Estimation [65.8413964785972]
We propose a design paradigm for cost-effective network with LR representation for efficient pose estimation, named FasterPose. We study the training behavior of FasterPose, and formulate a novel regressive cross-entropy (RCE) loss function for accelerating the convergence. Compared with the previously dominant network of pose estimation, our method reduces 58% of the FLOPs and simultaneously gains 1.3% improvement of accuracy.
arXiv Detail & Related papers (2021-07-07T13:39:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.