Event-aware Video Corpus Moment Retrieval
- URL: http://arxiv.org/abs/2402.13566v1
- Date: Wed, 21 Feb 2024 06:55:20 GMT
- Title: Event-aware Video Corpus Moment Retrieval
- Authors: Danyang Hou and Liang Pang and Huawei Shen and Xueqi Cheng
- Abstract summary: Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
- Score: 79.48249428428802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task
focused on identifying a specific moment within a vast corpus of untrimmed
videos using the natural language query. Existing methods for VCMR typically
rely on frame-aware video retrieval, calculating similarities between the query
and video frames to rank videos based on maximum frame similarity.However, this
approach overlooks the semantic structure embedded within the information
between frames, namely, the event, a crucial element for human comprehension of
videos. Motivated by this, we propose EventFormer, a model that explicitly
utilizes events within videos as fundamental units for video retrieval. The
model extracts event representations through event reasoning and hierarchical
event encoding. The event reasoning module groups consecutive and visually
similar frame representations into events, while the hierarchical event
encoding encodes information at both the frame and event levels. We also
introduce anchor multi-head self-attenion to encourage Transformer to capture
the relevance of adjacent content in the video. The training of EventFormer is
conducted by two-branch contrastive learning and dual optimization for two
sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo
benchmarks show the effectiveness and efficiency of EventFormer in VCMR,
achieving new state-of-the-art results. Additionally, the effectiveness of
EventFormer is also validated on partially relevant video retrieval task.
Related papers
- EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Leveraging the Video-level Semantic Consistency of Event for
Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task.
It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer.
We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z) - Video Imprint [107.1365846180187]
A new unified video analytics framework (ER3) is proposed for complex event retrieval, recognition and recounting.
The proposed video imprint representation exploits temporal correlations among image features across video frames.
The video imprint is fed into a reasoning network and a feature aggregation module, for event recognition/recounting and event retrieval tasks, respectively.
arXiv Detail & Related papers (2021-06-07T00:32:47Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.