Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data
- URL: http://arxiv.org/abs/2310.05010v1
- Date: Sun, 8 Oct 2023 04:46:43 GMT
- Title: Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data
- Authors: Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S.
Davis, Yu-Gang Jiang
- Abstract summary: This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
- Score: 102.0069667710562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant results achieved by Contrastive Language-Image
Pretraining (CLIP) in zero-shot image recognition, limited effort has been made
exploring its potential for zero-shot video recognition. This paper presents
Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong
zero-shot video classifier, capable of identifying novel actions and events
during testing. Open-VCLIP++ minimally modifies CLIP to capture
spatial-temporal relationships in videos, thereby creating a specialized video
classifier while striving for generalization. We formally demonstrate that
training Open-VCLIP++ is tantamount to continual learning with zero historical
data. To address this problem, we introduce Interpolated Weight Optimization, a
technique that leverages the advantages of weight interpolation during both
training and testing. Furthermore, we build upon large language models to
produce fine-grained video descriptions. These detailed descriptions are
further aligned with video features, facilitating a better transfer of CLIP to
the video domain. Our approach is evaluated on three widely used action
recognition datasets, following a variety of zero-shot evaluation protocols.
The results demonstrate that our method surpasses existing state-of-the-art
techniques by significant margins. Specifically, we achieve zero-shot accuracy
scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets
respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%,
and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval
dataset, where it delivers competitive video-to-text and text-to-video
retrieval performance, while utilizing substantially less fine-tuning data
compared to other methods. Code is released at
https://github.com/wengzejia1/Open-VCLIP.
Related papers
- Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.