Parameter-Efficient Image-to-Video Transfer Learning
- URL: http://arxiv.org/abs/2206.13559v1
- Date: Mon, 27 Jun 2022 18:02:29 GMT
- Title: Parameter-Efficient Image-to-Video Transfer Learning
- Authors: Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, Hongsheng Li
- Abstract summary: Large pre-trained models for various downstream tasks of interest have recently emerged with promising performance.
Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes costly in terms of model training and storage.
We propose a new Spatio-Adapter for parameter-efficient fine-tuning per video task.
- Score: 66.82811235484607
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Capitalizing on large pre-trained models for various downstream tasks of
interest have recently emerged with promising performance. Due to the
ever-growing model size, the standard full fine-tuning based task adaptation
strategy becomes prohibitively costly in terms of model training and storage.
This has led to a new research direction in parameter-efficient transfer
learning. However, existing attempts typically focus on downstream tasks from
the same modality (e.g., image understanding) of the pre-trained model. This
creates a limit because in some specific modalities, (e.g., video
understanding) such a strong pre-trained model with sufficient knowledge is
less or not available. In this work, we investigate such a novel cross-modality
transfer learning setting, namely parameter-efficient image-to-video transfer
learning. To solve this problem, we propose a new Spatio-Temporal Adapter
(ST-Adapter) for parameter-efficient fine-tuning per video task. With a
built-in spatio-temporal reasoning capability in a compact design, ST-Adapter
enables a pre-trained image model without temporal knowledge to reason about
dynamic video content at a small (~8%) per-task parameter cost, requiring
approximately 20 times fewer updated parameters compared to previous work.
Extensive experiments on video action recognition tasks show that our
ST-Adapter can match or even outperform the strong full fine-tuning strategy
and state-of-the-art video models, whilst enjoying the advantage of parameter
efficiency.
Related papers
- SurgPETL: Parameter-Efficient Image-to-Surgical-Video Transfer Learning for Surgical Phase Recognition [9.675072799670458]
"Image pre-training followed by video fine-tuning" for high-dimensional video data poses significant performance bottlenecks.
In this paper, we develop a parameter-efficient transfer learning benchmark SurgPETL for surgical phase recognition.
We conduct extensive experiments with three advanced methods based on ViTs of two distinct scales pre-trained on five large-scale natural and medical datasets.
arXiv Detail & Related papers (2024-09-30T08:33:50Z) - TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning [6.329214318116305]
We propose a memory-efficient Temporal Difference Side Network ( TDS-CLIP) to balance knowledge transferring and temporal modeling.
Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features.
We also designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos.
arXiv Detail & Related papers (2024-08-20T09:40:08Z) - FE-Adapter: Adapting Image-based Emotion Classifiers to Videos [21.294212686294568]
We present the Facial-Emotion Adapter (FE-Adapter), designed for efficient fine-tuning in video tasks.
FE-Adapter can match or even surpass existing fine-tuning and video emotion models in both performance and efficiency.
arXiv Detail & Related papers (2024-08-05T12:27:28Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video [15.952896909797728]
Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks.
Recent research is shifting its focus toward parameter-efficient image-to-video adaptation.
We present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks.
arXiv Detail & Related papers (2023-10-02T16:41:20Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.