CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
- URL: http://arxiv.org/abs/2106.11097v1
- Date: Mon, 21 Jun 2021 13:30:33 GMT
- Title: CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
- Authors: Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen
- Abstract summary: We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
- Score: 13.270902407320005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present CLIP2Video network to transfer the image-language pre-training
model to video-text retrieval in an end-to-end manner. Leading approaches in
the domain of video-and-language learning try to distill the spatio-temporal
video features and multi-modal interaction between videos and languages from a
large-scale video-text dataset. Different from them, we leverage pretrained
image-language model, simplify it as a two-stage framework with co-learning of
image-text and enhancing temporal relations between video frames and video-text
respectively, make it able to train on comparatively small datasets.
Specifically, based on the spatial semantics captured by Contrastive
Language-Image Pretraining (CLIP) model, our model involves a Temporal
Difference Block to capture motions at fine temporal video frames, and a
Temporal Alignment Block to re-align the tokens of video clips and phrases and
enhance the multi-modal correlation. We conduct thorough ablation studies, and
achieve state-of-the-art performance on major text-to-video and video-to-text
retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT,
MSVD and VATEX.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.