Dense Video Understanding with Gated Residual Tokenization
- URL: http://arxiv.org/abs/2509.14199v2
- Date: Thu, 18 Sep 2025 13:17:10 GMT
- Title: Dense Video Understanding with Gated Residual Tokenization
- Authors: Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu,
- Abstract summary: High temporal resolution is essential for capturing fine-grained details in video understanding.<n>Current benchmarks rely mostly on low-frame-rate sampling.<n>Dense Video Understanding (DVU) enables high-FPS video comprehension by reducing both tokenization time and token overhead.
- Score: 49.17263029080152
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.
Related papers
- Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [50.30283773196725]
Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
arXiv Detail & Related papers (2026-02-20T18:59:50Z) - SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM [36.28285195488772]
Large language models (LLMs) have demonstrated exceptional capabilities in text understanding.<n>Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information.<n>This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding.
arXiv Detail & Related papers (2026-02-03T14:39:16Z) - TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding [14.570869250170139]
TV-RAG is a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning.<n>By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning.
arXiv Detail & Related papers (2025-12-29T14:10:22Z) - Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference [5.146388234814547]
Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues.<n>We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches.<n>EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences.
arXiv Detail & Related papers (2025-10-16T12:34:38Z) - ResidualViT for Efficient Temporally Dense Video Encoding [66.57779133786131]
We make three contributions to reduce the cost of computing features for temporally dense tasks.<n>First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos.<n>Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model.
arXiv Detail & Related papers (2025-09-16T17:12:23Z) - When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding [12.410012029024342]
We present Grounded VideoDiT, a Video LLM designed to overcome limitations by introducing three key innovations.<n>First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency.<n>Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment.<n>Third, a mixed token scheme with discrete temporal timestamp tokens provides explicit modeling, enabling fine grained temporal reasoning.
arXiv Detail & Related papers (2025-08-21T15:12:14Z) - KFFocus: Highlighting Keyframes for Enhanced Video Understanding [33.69757683688046]
We propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames.<n>By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details.<n>We also introduce a multimodal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame.
arXiv Detail & Related papers (2025-08-12T14:57:03Z) - Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation [61.59996525424585]
DIFFVSGG is an online VSGG solution that frames this task as an iterative scene graph update problem.<n>We unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding.<n>DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs.
arXiv Detail & Related papers (2025-03-18T06:49:51Z) - VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.