EZ-CLIP: Efficient Zeroshot Video Action Recognition
- URL: http://arxiv.org/abs/2312.08010v2
- Date: Fri, 19 Jan 2024 12:19:48 GMT
- Title: EZ-CLIP: Efficient Zeroshot Video Action Recognition
- Authors: Shahzad Ahmad, Sukalpa Chanda, Yogesh S Rawat
- Abstract summary: We present EZ-CLIP, a simple and efficient adaptation of CLIP.
We introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion.
EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.
- Score: 13.403597169664803
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advancements in large-scale pre-training of visual-language models on
paired image-text data have demonstrated impressive generalization capabilities
for zero-shot tasks. Building on this success, efforts have been made to adapt
these image-based visual-language models, such as CLIP, for videos extending
their zero-shot capabilities to the video domain. While these adaptations have
shown promising results, they come at a significant computational cost and
struggle with effectively modeling the crucial temporal aspects inherent to the
video domain. In this study, we present EZ-CLIP, a simple and efficient
adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal
visual prompting for seamless temporal adaptation, requiring no fundamental
alterations to the core CLIP architecture while preserving its remarkable
generalization abilities. Moreover, we introduce a novel learning objective
that guides the temporal visual prompts to focus on capturing motion, thereby
enhancing its learning capabilities from video data. We conducted extensive
experiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP
for zero-shot learning and base-to-novel video action recognition, and also
demonstrating its potential for few-shot generalization.Impressively, with a
mere 5.2 million learnable parameters (as opposed to the 71.1 million in the
prior best model), EZ-CLIP can be efficiently trained on a single GPU,
outperforming existing approaches in several evaluations.
Related papers
- Is Temporal Prompting All We Need For Limited Labeled Action Recognition? [11.47868206641396]
We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture.
TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data.
arXiv Detail & Related papers (2025-04-02T16:50:28Z) - VideoWorld: Exploring Knowledge Learning from Unlabeled Videos [119.35107657321902]
This work explores whether a deep generative model can learn complex knowledge solely from visual input.
We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.
arXiv Detail & Related papers (2025-01-16T18:59:10Z) - Learning Visual Composition through Improved Semantic Guidance [19.24813992815684]
We show that by substantially improving weakly labeled data, we can vastly improve the performance of standard contrastive learning approaches.
We showcase our results on a relatively new captioning benchmark derived from DOCCI.
We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.
arXiv Detail & Related papers (2024-12-19T20:58:26Z) - Towards Multimodal In-Context Learning for Vision & Language Models [21.69457980865084]
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality.
We propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes.
arXiv Detail & Related papers (2024-03-19T13:53:37Z) - Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - A CLIP-Hitchhiker's Guide to Long Video Retrieval [84.36155238161462]
We study the adaptation of image-text models for long video retrieval.
Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP.
We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement.
arXiv Detail & Related papers (2022-05-17T17:26:23Z) - Motion-Focused Contrastive Learning of Video Representations [94.93666741396444]
Motion as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning.
We present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation.
arXiv Detail & Related papers (2022-01-11T16:15:45Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.