Video Corpus Moment Retrieval with Contrastive Learning
- URL: http://arxiv.org/abs/2105.06247v1
- Date: Thu, 13 May 2021 12:54:39 GMT
- Title: Video Corpus Moment Retrieval with Contrastive Learning
- Authors: Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi
Zhou, Rick Siow Mong Goh
- Abstract summary: Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
- Score: 56.249924768243375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a collection of untrimmed and unsegmented videos, video corpus moment
retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video)
that semantically corresponds to a given text query. As video and text are from
two distinct feature spaces, there are two general approaches to address VCMR:
(i) to separately encode each modality representations, then align the two
modality representations for query processing, and (ii) to adopt fine-grained
cross-modal interaction to learn multi-modal representations for query
processing. While the second approach often leads to better retrieval accuracy,
the first approach is far more efficient. In this paper, we propose a Retrieval
and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We
adopt the first approach and introduce two contrastive learning objectives to
refine video encoder and text encoder to learn video and text representations
separately but with better alignment for VCMR. The video contrastive learning
(VideoCL) is to maximize mutual information between query and candidate video
at video-level. The frame contrastive learning (FrameCL) aims to highlight the
moment region corresponds to the query at frame-level, within a video.
Experimental results show that, although ReLoCLNet encodes text and video
separately for efficiency, its retrieval accuracy is comparable with baselines
adopting cross-modal interaction learning.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - Generative Video Diffusion for Unseen Cross-Domain Video Moment
Retrieval [58.17315970207874]
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships.
Existing methods resort to joint training on both source and target domain videos for cross-domain applications.
We explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences.
arXiv Detail & Related papers (2024-01-24T09:45:40Z) - Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture.
In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection.
In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.