DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
- URL: http://arxiv.org/abs/2505.16376v1
- Date: Thu, 22 May 2025 08:29:57 GMT
- Title: DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
- Authors: Zijia Lu, A S M Iftekhar, Gaurav Mittal, Tianjian Meng, Xiawei Wang, Cheng Zhao, Rohith Kukkala, Ehsan Elhamifar, Mei Chen,
- Abstract summary: Long Video Temporal Grounding aims at identifying specific moments within lengthy videos based on user-provided text queries.<n>Existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale.<n>We introduce DeCafNet, an approach employing delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance.
- Score: 31.42088612166144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at https://github.com/ZijiaLewisLu/CVPR2025-DeCafNet.
Related papers
- Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering [36.94781787191615]
We propose a simple yet effective approach with active moment discovering (AMDNet)<n>We are committed to discovering video moments that are semantically consistent with their queries.<n>Experiments on two large-scale video datasets demonstrate the superiority and efficiency of our AMDNet.
arXiv Detail & Related papers (2025-04-15T07:00:18Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Parameter-free Video Segmentation for Vision and Language Understanding [55.20132267309382]
We propose an algorithm for segmenting videos into contiguous chunks, based on the minimum description length principle.<n>The algorithm is entirely parameter-free, given feature vectors, not requiring a set threshold or the number or size of chunks to be specified.
arXiv Detail & Related papers (2025-03-03T05:54:37Z) - SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation.
SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z) - RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos [16.916873537450424]
Existing methods typically operate in two stages: clip retrieval and grounding.
We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels.
RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.
arXiv Detail & Related papers (2023-12-11T09:12:35Z) - Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding
in Long Videos [60.86880787242561]
Video temporal grounding aims to pinpoint a video segment that matches the query description.
We propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with textbfone-time network execution.
Our method significantly outperforms state-of-the-arts, and achieves textbf14.6$times$ / textbf102.8$times$ higher efficiency respectively.
arXiv Detail & Related papers (2023-03-15T03:54:43Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound [103.28102473127748]
We introduce an audiovisual method for long-range text-to-video retrieval.
Our approach aims to retrieve minute-long videos that capture complex human actions.
Our method is 2.92x faster and 2.34x memory-efficient than long-range video-only approaches.
arXiv Detail & Related papers (2022-04-06T14:43:42Z) - Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action
Localization [96.73647162960842]
TAL is a fundamental yet challenging task in video understanding.
Existing TAL methods rely on pre-training a video encoder through action classification supervision.
We introduce a novel low-fidelity end-to-end (LoFi) video encoder pre-training method.
arXiv Detail & Related papers (2021-03-28T22:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.