Semantic Video Moments Retrieval at Scale: A New Task and a Baseline
- URL: http://arxiv.org/abs/2210.08389v1
- Date: Sat, 15 Oct 2022 22:46:22 GMT
- Title: Semantic Video Moments Retrieval at Scale: A New Task and a Baseline
- Authors: Na Li
- Abstract summary: Semantic Video Moments Retrieval at scale (SVMR) aims at finding relevant videos and re-localizing the video clips in them.
To address these challenges, we propose our two-stage baseline solution of candidate videos retrieval followed by a novel attention-based query-reference semantically alignment framework.
- Score: 6.997674465889922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the increasing need of saving search effort by obtaining
relevant video clips instead of whole videos, we propose a new task, named
Semantic Video Moments Retrieval at scale (SVMR), which aims at finding
relevant videos coupled with re-localizing the video clips in them. Instead of
a simple combination of video retrieval and video re-localization, our task is
more challenging because of several essential aspects. In the 1st stage, our
SVMR should take into account the fact that: 1) a positive candidate long video
can contain plenty of irrelevant clips which are also semantically meaningful.
2) a long video can be positive to two totally different query clips if it
contains clips relevant to two queries. The 2nd re-localization stage also
exhibits different assumptions from existing video re-localization tasks, which
hold an assumption that the reference video must contain semantically similar
segments corresponding to the query clip. Instead, in our scenario, the
retrieved long video can be a false positive one due to the inaccuracy of the
first stage. To address these challenges, we propose our two-stage baseline
solution of candidate videos retrieval followed by a novel attention-based
query-reference semantically alignment framework to re-localize target clips
from candidate videos. Furthermore, we build two more appropriate benchmark
datasets from the off-the-shelf ActivityNet-1.3 and HACS for a thorough
evaluation of SVMR models. Extensive experiments are carried out to show that
our solution outperforms several reference solutions.
Related papers
- Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets [62.280729345770936]
We introduce the task of Alignable Video Retrieval (AVR)
Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query.
Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-02T20:00:49Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Generative Video Diffusion for Unseen Cross-Domain Video Moment
Retrieval [58.17315970207874]
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships.
Existing methods resort to joint training on both source and target domain videos for cross-domain applications.
We explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences.
arXiv Detail & Related papers (2024-01-24T09:45:40Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Partially Relevant Video Retrieval [39.747235541498135]
We propose a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR)
PRVR aims to retrieve partially relevant videos from a large collection of untrimmed videos.
We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames.
arXiv Detail & Related papers (2022-08-26T09:07:16Z) - CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.