Frozen CLIP Models are Efficient Video Learners
- URL: http://arxiv.org/abs/2208.03550v1
- Date: Sat, 6 Aug 2022 17:38:25 GMT
- Title: Frozen CLIP Models are Efficient Video Learners
- Authors: Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo,
Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li
- Abstract summary: Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
- Score: 86.73871814176795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video recognition has been dominated by the end-to-end learning paradigm --
first initializing a video recognition model with weights of a pretrained image
model and then conducting end-to-end training on videos. This enables the video
network to benefit from the pretrained image model. However, this requires
substantial computation and memory resources for finetuning on videos and the
alternative of directly using pretrained image features without finetuning the
image backbone leads to subpar results. Fortunately, recent advances in
Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route
for visual recognition tasks. Pretrained on large open-vocabulary image-text
pair data, these models learn powerful visual representations with rich
semantics. In this paper, we present Efficient Video Learning (EVL) -- an
efficient framework for directly training high-quality video recognition models
with frozen CLIP features. Specifically, we employ a lightweight Transformer
decoder and learn a query token to dynamically collect frame-level spatial
features from the CLIP image encoder. Furthermore, we adopt a local temporal
module in each decoder layer to discover temporal clues from adjacent frames
and their attention maps. We show that despite being efficient to train with a
frozen backbone, our models learn high quality video representations on a
variety of video recognition datasets. Code is available at
https://github.com/OpenGVLab/efficient-video-recognition.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - iBoot: Image-bootstrapped Self-Supervised Video Representation Learning [45.845595749486215]
Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets.
We propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework.
The proposed algorithm is shown to learn much more efficiently in less epochs and with a smaller batch.
arXiv Detail & Related papers (2022-06-16T17:42:48Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.