Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
- URL: http://arxiv.org/abs/2603.01400v1
- Date: Mon, 02 Mar 2026 03:06:40 GMT
- Title: Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
- Authors: Jinlong Li, Liyuan Jiang, Haonan Zhang, Nicu Sebe,
- Abstract summary: Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens.<n>We propose a new perspective that elaborates token textbfAnchors within intra-frame and inter-frame contexts.<n>Our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs.
- Score: 61.11154533305096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.
Related papers
- Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z) - KFFocus: Highlighting Keyframes for Enhanced Video Understanding [33.69757683688046]
We propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames.<n>By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details.<n>We also introduce a multimodal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame.
arXiv Detail & Related papers (2025-08-12T14:57:03Z) - Temporal Cluster Assignment for Efficient Real-Time Video Segmentation [9.248291541710781]
Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains.<n>The window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques.<n>We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence.
arXiv Detail & Related papers (2025-08-07T20:52:49Z) - HoliTom: Holistic Token Merging for Fast Video Large Language Models [32.620504076794795]
Video language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens.<n>We introduce HoliTom, a novel training-free holistic token framework.<n>We also introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning.
arXiv Detail & Related papers (2025-05-27T15:28:45Z) - LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs [55.81291976637705]
Large models (LMMs) excel in scene understanding but struggle with fine-temporal reasoning due to weak alignment between linguistic and visual representations.<n>Existing methods map textual positions and durations into the visual space from frame-based videos, but suffer from temporal sparsity that limits temporal coordination.<n>We introduce LFEA to leverage event cameras for temporally dense perception and frame-event fusion.
arXiv Detail & Related papers (2025-03-10T05:30:30Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Leveraging Temporal Contextualization for Video Action Recognition [47.8361303269338]
We propose a framework for video understanding called Temporally Contextualized CLIP (TC-CLIP)
We introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos.
The Video-Prompting (VP) module processes context tokens to generate informative prompts in the text modality.
arXiv Detail & Related papers (2024-04-15T06:24:56Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.