Related papers: Faster Video Moment Retrieval with Point-Level Supervision

Faster Video Moment Retrieval with Point-Level Supervision

URL: http://arxiv.org/abs/2305.14017v1
Date: Tue, 23 May 2023 12:53:50 GMT
Title: Faster Video Moment Retrieval with Point-Level Supervision
Authors: Xun Jiang, Zailei Zhou, Xing Xu, Yang Yang, Guoqing Wang, Heng Tao Shen
Abstract summary: Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an untrimmed video with natural language queries. Existing VMR methods suffer from two defects: massive expensive temporal annotations and complicated cross-modal interaction modules. We propose a novel method termed Cheaper and Faster Moment Retrieval (CFMR)
Score: 70.51822333023145
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an untrimmed video with natural language queries. Existing VMR methods suffer from two defects: (1) massive expensive temporal annotations are required to obtain satisfying performance; (2) complicated cross-modal interaction modules are deployed, which lead to high computational cost and low efficiency for the retrieval process. To address these issues, we propose a novel method termed Cheaper and Faster Moment Retrieval (CFMR), which well balances the retrieval accuracy, efficiency, and annotation cost for VMR. Specifically, our proposed CFMR method learns from point-level supervision where each annotation is a single frame randomly located within the target moment. It is 6 times cheaper than the conventional annotations of event boundaries. Furthermore, we also design a concept-based multimodal alignment mechanism to bypass the usage of cross-modal interaction modules during the inference process, remarkably improving retrieval efficiency. The experimental results on three widely used VMR benchmarks demonstrate the proposed CFMR method establishes new state-of-the-art with point-level supervision. Moreover, it significantly accelerates the retrieval speed with more than 100 times FLOPs compared to existing approaches with point-level supervision.

Related papers

Exploiting Inter-Sample Correlation and Intra-Sample Redundancy for Partially Relevant Video Retrieval [5.849812241074385]
PRVR aims to retrieve the target video that is partially relevant to a text query. Existing methods coarsely align paired videos and text queries to construct the semantic space. We propose a novel PRVR framework to systematically exploit inter-sample correlation and intra-sample redundancy.
arXiv Detail & Related papers (2025-04-28T09:52:46Z)
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking [3.5291730624600848]
Long-form video understanding presents significant challenges for interactive retrieval systems. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking. This paper presents a novel framework to enhance interactive video retrieval through four key innovations.
arXiv Detail & Related papers (2025-04-11T09:36:46Z)
Fast Online Video Super-Resolution with Deformable Attention Pyramid [172.16491820970646]
Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV. We propose a recurrent VSR architecture based on a deformable attention pyramid (DAP)
arXiv Detail & Related papers (2022-02-03T17:49:04Z)
Temporal Context Aggregation Network for Temporal Action Proposal Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field. Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval. We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z)
Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner. We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z)
MuCAN: Multi-Correspondence Aggregation Network for Video Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame. Inter- and intra-frames are the key sources for exploiting temporal and spatial information. We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
Video Face Super-Resolution with Motion-Adaptive Feedback Cell [90.73821618795512]
Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN) In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way.
arXiv Detail & Related papers (2020-02-15T13:14:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.