Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization
- URL: http://arxiv.org/abs/2302.00624v3
- Date: Wed, 31 May 2023 02:54:28 GMT
- Title: Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization
- Authors: Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang
- Abstract summary: We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
- Score: 82.75718846187685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive
zero-shot learning abilities for image understanding, yet limited effort has
been made to investigate CLIP for zero-shot video recognition. We introduce
Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong
zero-shot video classifier that can recognize unseen actions and events at test
time. Our framework extends CLIP with minimal modifications to model
spatial-temporal relationships in videos, making it a specialized video
classifier, while striving for generalization. We formally show that training
an Open-VCLIP is equivalent to continual learning with zero historical data. To
address this problem, we propose Interpolated Weight Optimization, which
utilizes the benefit of weight interpolation in both training and test time. We
evaluate our method on three popular and challenging action recognition
datasets following various zero-shot evaluation protocols and we demonstrate
our approach outperforms state-of-the-art methods by clear margins. In
particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and
Kinetics-600 respectively, outperforming state-of-the-art methods by 8.3%, 7.8%
and 12.2%. Code is released at https://github.com/wengzejia1/Open-VCLIP.
Related papers
- Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting [111.49781716597984]
We propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training.
We can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.
arXiv Detail & Related papers (2023-04-06T18:00:04Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.