Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data
- URL: http://arxiv.org/abs/2310.05010v1
- Date: Sun, 8 Oct 2023 04:46:43 GMT
- Title: Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data
- Authors: Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S.
Davis, Yu-Gang Jiang
- Abstract summary: This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
- Score: 102.0069667710562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant results achieved by Contrastive Language-Image
Pretraining (CLIP) in zero-shot image recognition, limited effort has been made
exploring its potential for zero-shot video recognition. This paper presents
Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong
zero-shot video classifier, capable of identifying novel actions and events
during testing. Open-VCLIP++ minimally modifies CLIP to capture
spatial-temporal relationships in videos, thereby creating a specialized video
classifier while striving for generalization. We formally demonstrate that
training Open-VCLIP++ is tantamount to continual learning with zero historical
data. To address this problem, we introduce Interpolated Weight Optimization, a
technique that leverages the advantages of weight interpolation during both
training and testing. Furthermore, we build upon large language models to
produce fine-grained video descriptions. These detailed descriptions are
further aligned with video features, facilitating a better transfer of CLIP to
the video domain. Our approach is evaluated on three widely used action
recognition datasets, following a variety of zero-shot evaluation protocols.
The results demonstrate that our method surpasses existing state-of-the-art
techniques by significant margins. Specifically, we achieve zero-shot accuracy
scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets
respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%,
and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval
dataset, where it delivers competitive video-to-text and text-to-video
retrieval performance, while utilizing substantially less fine-tuning data
compared to other methods. Code is released at
https://github.com/wengzejia1/Open-VCLIP.
Related papers
- T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.