Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
- URL: http://arxiv.org/abs/2304.03307v1
- Date: Thu, 6 Apr 2023 18:00:04 GMT
- Title: Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
- Authors: Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan,
Mubarak Shah
- Abstract summary: We propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training.
We can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.
- Score: 111.49781716597984
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Adopting contrastive image-text pretrained models like CLIP towards video
classification has gained attention due to its cost-effectiveness and
competitive performance. However, recent works in this area face a trade-off.
Finetuning the pretrained model to achieve strong supervised performance
results in low zero-shot generalization. Similarly, freezing the backbone to
retain zero-shot capability causes significant drop in supervised accuracy.
Because of this, recent works in literature typically train separate models for
supervised and zero-shot action recognition. In this work, we propose a
multimodal prompt learning scheme that works to balance the supervised and
zero-shot performance under a single unified training. Our prompting approach
on the vision side caters for three aspects: 1) Global video-level prompts to
model the data distribution; 2) Local frame-level prompts to provide per-frame
discriminative conditioning; and 3) a summary prompt to extract a condensed
video representation. Additionally, we define a prompting scheme on the text
side to augment the textual context. Through this prompting scheme, we can
achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and
UCF101 while remaining competitive in the supervised setting. By keeping the
pretrained backbone frozen, we optimize a much lower number of parameters and
retain the existing general representation which helps achieve the strong
zero-shot performance. Our codes/models are released at
https://github.com/TalalWasim/Vita-CLIP.
Related papers
- Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - Understanding Zero-Shot Adversarial Robustness for Large-Scale Models [31.295249927085475]
We identify and explore the problem of emphadapting large-scale models for zero-shot adversarial robustness.
We propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning.
Our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of over 31 points over ImageNet and 15 zero-shot datasets.
arXiv Detail & Related papers (2022-12-14T04:08:56Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Action Localization through Continual Predictive Learning [14.582013761620738]
We present a new approach based on continual learning that uses feature-level predictions for self-supervision.
We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames.
This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization.
arXiv Detail & Related papers (2020-03-26T23:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.