AIM: Adapting Image Models for Efficient Video Action Recognition
- URL: http://arxiv.org/abs/2302.03024v1
- Date: Mon, 6 Feb 2023 18:59:17 GMT
- Title: AIM: Adapting Image Models for Efficient Video Action Recognition
- Authors: Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, Mu Li
- Abstract summary: We propose a method to Adapt pre-trained Image Models (AIM) for efficient video understanding.
By freezing the pre-trained video model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation.
We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters.
- Score: 22.805026175928997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent vision transformer based video models mostly follow the ``image
pre-training then finetuning" paradigm and have achieved great success on
multiple video benchmarks. However, full finetuning such a video model could be
computationally expensive and unnecessary, given the pre-trained image
transformer models have demonstrated exceptional transferability. In this work,
we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient
video understanding. By freezing the pre-trained image model and adding a few
lightweight Adapters, we introduce spatial adaptation, temporal adaptation and
joint adaptation to gradually equip an image model with spatiotemporal
reasoning capability. We show that our proposed AIM can achieve competitive or
even better performance than prior arts with substantially fewer tunable
parameters on four video action recognition benchmarks. Thanks to its
simplicity, our method is also generally applicable to different image
pre-trained models, which has the potential to leverage more powerful image
foundation models in the future. The project webpage is
\url{https://adapt-image-models.github.io/}.
Related papers
- AVID: Adapting Video Diffusion Models to World Models [10.757223474031248]
We propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model.
AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos.
We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.
arXiv Detail & Related papers (2024-10-01T13:48:31Z) - FE-Adapter: Adapting Image-based Emotion Classifiers to Videos [21.294212686294568]
We present the Facial-Emotion Adapter (FE-Adapter), designed for efficient fine-tuning in video tasks.
FE-Adapter can match or even surpass existing fine-tuning and video emotion models in both performance and efficiency.
arXiv Detail & Related papers (2024-08-05T12:27:28Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video [15.952896909797728]
Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks.
Recent research is shifting its focus toward parameter-efficient image-to-video adaptation.
We present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks.
arXiv Detail & Related papers (2023-10-02T16:41:20Z) - Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model.
Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Parameter-Efficient Image-to-Video Transfer Learning [66.82811235484607]
Large pre-trained models for various downstream tasks of interest have recently emerged with promising performance.
Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes costly in terms of model training and storage.
We propose a new Spatio-Adapter for parameter-efficient fine-tuning per video task.
arXiv Detail & Related papers (2022-06-27T18:02:29Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.