Faster Video Moment Retrieval with Point-Level Supervision
- URL: http://arxiv.org/abs/2305.14017v1
- Date: Tue, 23 May 2023 12:53:50 GMT
- Title: Faster Video Moment Retrieval with Point-Level Supervision
- Authors: Xun Jiang, Zailei Zhou, Xing Xu, Yang Yang, Guoqing Wang, Heng Tao
Shen
- Abstract summary: Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an untrimmed video with natural language queries.
Existing VMR methods suffer from two defects: massive expensive temporal annotations and complicated cross-modal interaction modules.
We propose a novel method termed Cheaper and Faster Moment Retrieval (CFMR)
- Score: 70.51822333023145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Moment Retrieval (VMR) aims at retrieving the most relevant events from
an untrimmed video with natural language queries. Existing VMR methods suffer
from two defects: (1) massive expensive temporal annotations are required to
obtain satisfying performance; (2) complicated cross-modal interaction modules
are deployed, which lead to high computational cost and low efficiency for the
retrieval process. To address these issues, we propose a novel method termed
Cheaper and Faster Moment Retrieval (CFMR), which well balances the retrieval
accuracy, efficiency, and annotation cost for VMR. Specifically, our proposed
CFMR method learns from point-level supervision where each annotation is a
single frame randomly located within the target moment. It is 6 times cheaper
than the conventional annotations of event boundaries. Furthermore, we also
design a concept-based multimodal alignment mechanism to bypass the usage of
cross-modal interaction modules during the inference process, remarkably
improving retrieval efficiency. The experimental results on three widely used
VMR benchmarks demonstrate the proposed CFMR method establishes new
state-of-the-art with point-level supervision. Moreover, it significantly
accelerates the retrieval speed with more than 100 times FLOPs compared to
existing approaches with point-level supervision.
Related papers
- Fast Online Video Super-Resolution with Deformable Attention Pyramid [172.16491820970646]
Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV.
We propose a recurrent VSR architecture based on a deformable attention pyramid (DAP)
arXiv Detail & Related papers (2022-02-03T17:49:04Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z) - Video Face Super-Resolution with Motion-Adaptive Feedback Cell [90.73821618795512]
Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN)
In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way.
arXiv Detail & Related papers (2020-02-15T13:14:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.