Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding
in Long Videos
- URL: http://arxiv.org/abs/2303.08345v2
- Date: Wed, 22 Mar 2023 12:41:03 GMT
- Title: Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding
in Long Videos
- Authors: Yulin Pan, Xiangteng He, Biao Gong, Yiliang Lv, Yujun Shen, Yuxin
Peng, Deli Zhao
- Abstract summary: Video temporal grounding aims to pinpoint a video segment that matches the query description.
We propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with textbfone-time network execution.
Our method significantly outperforms state-of-the-arts, and achieves textbf14.6$times$ / textbf102.8$times$ higher efficiency respectively.
- Score: 60.86880787242561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video temporal grounding aims to pinpoint a video segment that matches the
query description. Despite the recent advance in short-form videos
(\textit{e.g.}, in minutes), temporal grounding in long videos (\textit{e.g.},
in hours) is still at its early stage. To address this challenge, a common
practice is to employ a sliding window, yet can be inefficient and inflexible
due to the limited number of frames within the window. In this work, we propose
an end-to-end framework for fast temporal grounding, which is able to model an
hours-long video with \textbf{one-time} network execution. Our pipeline is
formulated in a coarse-to-fine manner, where we first extract context knowledge
from non-overlapped video clips (\textit{i.e.}, anchors), and then supplement
the anchors that highly response to the query with detailed content knowledge.
Besides the remarkably high pipeline efficiency, another advantage of our
approach is the capability of capturing long-range temporal correlation, thanks
to modeling the entire video as a whole, and hence facilitates more accurate
grounding. Experimental results suggest that, on the long-form video datasets
MAD and Ego4d, our method significantly outperforms state-of-the-arts, and
achieves \textbf{14.6$\times$} / \textbf{102.8$\times$} higher efficiency
respectively. Project can be found at
\url{https://github.com/afcedf/SOONet.git}.
Related papers
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - Encoding and Controlling Global Semantics for Long-form Video Question Answering [40.129800076300434]
We introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video.
Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations.
To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length.
arXiv Detail & Related papers (2024-05-30T06:10:10Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame.
Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block.
We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z) - LOVECon: Text-driven Training-Free Long Video Editing with ControlNet [9.762680144118061]
This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing.
We build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts.
Our method manages to edit videos comprising hundreds of frames according to user requirements.
arXiv Detail & Related papers (2023-10-15T02:39:25Z) - How Much Temporal Long-Term Context is Needed for Action Segmentation? [16.89998201009075]
We introduce a transformer-based model that leverages sparse attention to capture the full context of a video.
Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
arXiv Detail & Related papers (2023-08-22T11:20:40Z) - NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation [157.07019458623242]
NUWA-XL is a novel Diffusion over Diffusion architecture for eXtremely Long generation.
Our approach adopts a coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity.
Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s.
arXiv Detail & Related papers (2023-03-22T07:10:09Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.