Fine-tuned CLIP Models are Efficient Video Learners
- URL: http://arxiv.org/abs/2212.03640v3
- Date: Sun, 26 Mar 2023 11:40:16 GMT
- Title: Fine-tuned CLIP Models are Efficient Video Learners
- Authors: Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan,
Fahad Shahbaz Khan
- Abstract summary: Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
- Score: 54.96069171726668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale multi-modal training with image-text pairs imparts strong
generalization to CLIP model. Since training on a similar scale for videos is
infeasible, recent approaches focus on the effective transfer of image-based
CLIP to the video domain. In this pursuit, new parametric modules are added to
learn temporal information and inter-frame relationships which require
meticulous design efforts. Furthermore, when the resulting models are learned
on videos, they tend to overfit on the given task distribution and lack in
generalization aspect. This begs the following question: How to effectively
transfer image-level CLIP representations to videos? In this work, we show that
a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to
bridge the domain gap from images to videos. Our qualitative analysis
illustrates that the frame-level processing from CLIP image-encoder followed by
feature pooling and similarity matching with corresponding text embeddings
helps in implicitly modeling the temporal cues within ViFi-CLIP. Such
fine-tuning helps the model to focus on scene dynamics, moving objects and
inter-object relationships. For low-data regimes where full fine-tuning is not
viable, we propose a `bridge and prompt' approach that first uses fine-tuning
to bridge the domain gap and then learns prompts on language and vision side to
adapt CLIP representations. We extensively evaluate this simple yet strong
baseline on zero-shot, base-to-novel generalization, few-shot and fully
supervised settings across five video benchmarks. Our code is available at
https://github.com/muzairkhattak/ViFi-CLIP.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.