CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment
- URL: http://arxiv.org/abs/2209.06430v1
- Date: Wed, 14 Sep 2022 05:47:02 GMT
- Title: CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment
- Authors: Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang
Li, Jiebo Luo
- Abstract summary: We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
- Score: 146.3128011522151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The pre-trained image-text models, like CLIP, have demonstrated the strong
power of vision-language representation learned from a large scale of
web-collected image-text data. In light of the well-learned visual features,
some existing works transfer image representation to video domain and achieve
good results. However, how to utilize image-language pre-trained model (e.g.,
CLIP) for video-language pre-training (post-pretraining) is still under
explored. In this paper, we investigate two questions: 1) what are the factors
hindering post-pretraining CLIP to further improve the performance on
video-language tasks? and 2) how to mitigate the impact of these factors?
Through a series of comparative experiments and analyses, we find that the data
scale and domain gap between language sources have great impacts. Motivated by
these, we propose a Omnisource Cross-modal Learning method equipped with a
Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results
show that our approach improves the performance of CLIP on video-text retrieval
by a large margin. Our model also achieves SOTA results on a variety of
datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We release our
code and pre-trained CLIP-ViP models at
https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.
Related papers
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - RWKV-CLIP: A Robust Vision-Language Representation Learner [31.501759213619646]
Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks.
We introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags.
We propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs.
arXiv Detail & Related papers (2024-06-11T06:10:46Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
Retrieval [31.7091206926183]
The CLIP (Contrastive Language-Image Pre-training) has demonstrated the power of visual concepts learning from web collected image-text datasets.
We propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
arXiv Detail & Related papers (2021-04-18T13:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.