Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning
- URL: http://arxiv.org/abs/2602.01649v1
- Date: Mon, 02 Feb 2026 05:09:48 GMT
- Title: Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning
- Authors: Yinchao Ma, Qiang Zhou, Zhibin Wang, Xianing Chen, Hanqing Yang, Jun Song, Bo Zheng,
- Abstract summary: CaCoVID is a novel token selection algorithm for textbfVIDeo understanding (textbfCaCoVID)<n>First, we introduce a reinforcement learning-based framework that prioritizes a policy network to select video token combinations with the greatest contribution to correct predictions.<n> Secondly, we propose a policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations.
- Score: 32.030660835097926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression algorithm for \textbf{VID}eo understanding (\textbf{CaCoVID}) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.
Related papers
- InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression [114.03378443007074]
Current tokenizers rigidly compress all content at a fixed rate, leading to redundancy or information loss.<n>This paper introduces InfoTok, a principled framework for adaptive video tokenization.<n>We develop a transformer-based adaptive compressor that enables adaptive tokenization.
arXiv Detail & Related papers (2025-12-18T17:13:59Z) - FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding [55.700832127331324]
FLoC is an efficient visual token compression framework based on the facility location function.<n>Our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens.<n>Our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution.
arXiv Detail & Related papers (2025-10-31T17:29:39Z) - LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs [23.801172170798132]
LLaVA-Scissor is a training-free token compression strategy designed for multimodal large language models.<n>We propose to leverage the Semantic Connected Components ( SCC) approach to ensure comprehensive semantic coverage.<n>We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks.
arXiv Detail & Related papers (2025-06-27T02:29:58Z) - Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space [94.07013629356113]
AdapTok is an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content.<n>AdapTok consistently improves reconstruction quality and generation performance under different token budgets.
arXiv Detail & Related papers (2025-05-22T17:59:02Z) - VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [35.38573641029626]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding [11.211803499867639]
We propose DYTO, a novel dynamic token merging framework for zero-shot video understanding.<n> DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences.<n>Experiments demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods.
arXiv Detail & Related papers (2024-11-21T18:30:11Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.