Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
- URL: http://arxiv.org/abs/2508.17686v1
- Date: Mon, 25 Aug 2025 05:51:21 GMT
- Title: Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
- Authors: Yogesh Kumar,
- Abstract summary: Vision Language Models (VLMs) struggle with long-form videos due to the complexity of attention mechanisms.<n>We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune video tokens.<n>Our model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a 65% reduction in computation while preserving 97-99% of the original performance.
- Score: 2.648500779572419
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune video tokens, preserving contextual continuity while reducing computational overhead. Unlike uniform pruning or keyframe selection, LGTTP retains higher token density in temporally relevant segments. Our model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a 65% reduction in computation while preserving 97-99% of the original performance. On QVHighlights, LGTTP improves HIT@1 by +9.5%, and on Charades-STA, it retains 99.6% of R@1. It excels on queries with explicit temporal markers and remains effective across general video understanding tasks.
Related papers
- Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models [61.11154533305096]
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens.<n>We propose a new perspective that elaborates token textbfAnchors within intra-frame and inter-frame contexts.<n>Our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs.
arXiv Detail & Related papers (2026-03-02T03:06:40Z) - Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding [56.45689495743107]
Vgent is a graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding.<n>We evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks.
arXiv Detail & Related papers (2025-10-15T19:14:58Z) - Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z) - DATE: Dynamic Absolute Time Enhancement for Long Video Understanding [8.720269393713451]
Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs)<n>We propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs.<n>We introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage.
arXiv Detail & Related papers (2025-09-11T08:49:22Z) - Temporal Cluster Assignment for Efficient Real-Time Video Segmentation [9.248291541710781]
Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains.<n>The window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques.<n>We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence.
arXiv Detail & Related papers (2025-08-07T20:52:49Z) - DisTime: Distribution-based Time Representation for Video Large Language Models [23.176698643825123]
DisTime is a lightweight framework designed to enhance temporal comprehension in Video-LLMs.<n>DisTime employs a learnable token to create a continuous temporal embedding space.<n>DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks.
arXiv Detail & Related papers (2025-05-30T08:10:18Z) - InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO [73.33751812982342]
InfLVG is an inference-time framework that enables coherent long video generation without requiring additional long-form video data.<n>We show that InfLVG can extend video length by up to 9$times$, achieving strong consistency and semantic fidelity across scenes.
arXiv Detail & Related papers (2025-05-23T07:33:25Z) - Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation [20.67434288227437]
ViLAMP is a hierarchical video-language model that processes hour-long videos at "mixed precision"<n>ViLAMP retains full information ins while reducing non-keyframes to their most salient features, resembling mixed-precision training.<n> Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU.
arXiv Detail & Related papers (2025-04-03T09:55:09Z) - TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos [50.04992164981131]
Temporal localization in untrimmed videos is crucial for video understanding but remains challenging.<n>This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection.<n>We propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks.
arXiv Detail & Related papers (2025-03-09T09:11:26Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame.
Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block.
We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.