CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
Retrieval
- URL: http://arxiv.org/abs/2104.08860v1
- Date: Sun, 18 Apr 2021 13:59:50 GMT
- Title: CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
Retrieval
- Authors: Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan,
Tianrui Li
- Abstract summary: The CLIP (Contrastive Language-Image Pre-training) has demonstrated the power of visual concepts learning from web collected image-text datasets.
We propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
- Score: 31.7091206926183
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-text retrieval plays an essential role in multi-modal research and has
been widely used in many real-world web applications. The CLIP (Contrastive
Language-Image Pre-training), an image-language pre-training model, has
demonstrated the power of visual concepts learning from web collected
image-text datasets. In this paper, we propose a CLIP4Clip model to transfer
the knowledge of the CLIP model to video-language retrieval in an end-to-end
manner. Several questions are investigated via empirical studies: 1) Whether
image feature is enough for video-text retrieval? 2) How a post-pretraining on
a large-scale video-text dataset based on the CLIP affect the performance? 3)
What is the practical mechanism to model temporal dependency between video
frames? And 4) The Hyper-parameters sensitivity of the model on video-text
retrieval task. Extensive experimental results present that the CLIP4Clip model
transferred from the CLIP can achieve SOTA results on various video-text
retrieval datasets, including MSR-VTT, MSVC, and LSMDC.
Related papers
- Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval [12.067700655401364]
We propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models.
AFA provides a fine-grained learning (teaching) channel for the student (teacher)
arXiv Detail & Related papers (2023-08-02T15:22:00Z) - ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models [6.073813559982129]
Video retrieval involves retrieving the ground truth video from the video database given a text caption or vice-versa.
We evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO.
Our experiments reveal that actions and syntax play a minor role compared to objects & attributes in video understanding.
arXiv Detail & Related papers (2023-06-28T20:06:36Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.