Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning
- URL: http://arxiv.org/abs/2309.07911v1
- Date: Thu, 14 Sep 2023 17:58:33 GMT
- Title: Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning
- Authors: Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao,
Deli Zhao, Nong Sang
- Abstract summary: We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
- Score: 59.26623999209235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large-scale pre-trained language-image models like CLIP have shown
extraordinary capabilities for understanding spatial contents, but naively
transferring such models to video recognition still suffers from unsatisfactory
temporal modeling capabilities. Existing methods insert tunable structures into
or in parallel with the pre-trained model, which either requires
back-propagation through the whole pre-trained model and is thus
resource-demanding, or is limited by the temporal reasoning capability of the
pre-trained structure. In this work, we present DiST, which disentangles the
learning of spatial and temporal aspects of videos. Specifically, DiST uses a
dual-encoder structure, where a pre-trained foundation model acts as the
spatial encoder, and a lightweight network is introduced as the temporal
encoder. An integration branch is inserted between the encoders to fuse
spatio-temporal information. The disentangled spatial and temporal learning in
DiST is highly efficient because it avoids the back-propagation of massive
pre-trained parameters. Meanwhile, we empirically show that disentangled
learning with an extra network for integration benefits both spatial and
temporal understanding. Extensive experiments on five benchmarks show that DiST
delivers better performance than existing state-of-the-art methods by
convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve
89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability
of DiST. Codes and models can be found in
https://github.com/alibaba-mmai-research/DiST.
Related papers
- D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition [60.84084172829169]
Adapting large pre-trained image models to few-shot action recognition has proven to be an effective strategy for learning robust feature extractors.
We present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$2$ST-Adapter), which is a novel tuning framework well-suited for few-shot action recognition.
arXiv Detail & Related papers (2023-12-03T15:40:10Z) - Orthogonal Temporal Interpolation for Zero-Shot Video Recognition [45.53856045374685]
Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process.
Recent vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR.
arXiv Detail & Related papers (2023-08-14T02:26:49Z) - OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive
Learning [67.07363529640784]
We propose OpenSTL to categorize prevalent approaches into recurrent-based and recurrent-free models.
We conduct standard evaluations on datasets across various domains, including synthetic moving object trajectory, human motion, driving scenes, traffic flow and forecasting weather.
We find that recurrent-free models achieve a good balance between efficiency and performance than recurrent models.
arXiv Detail & Related papers (2023-06-20T03:02:14Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding.
We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences.
Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z) - Shifted Chunk Transformer for Spatio-Temporal Representational Learning [24.361059477031162]
We construct a shifted chunk Transformer with pure self-attention blocks.
This Transformer can learn hierarchical-temporal features from a tiny patch to a global video clip.
It outperforms state-of-the-art approaches on Kinetics, Kinetics-600, UCF101, and HMDB51.
arXiv Detail & Related papers (2021-08-26T04:34:33Z) - Adaptive Machine Learning for Time-Varying Systems: Low Dimensional
Latent Space Tuning [91.3755431537592]
We present a recently developed method of adaptive machine learning for time-varying systems.
Our approach is to map very high (N>100k) dimensional inputs into the low dimensional (N2) latent space at the output of the encoder section of an encoder-decoder CNN.
This method allows us to learn correlations within and to track their evolution in real time based on feedback without interrupts.
arXiv Detail & Related papers (2021-07-13T16:05:28Z) - Gradient Forward-Propagation for Large-Scale Temporal Video Modelling [13.665160620951777]
Backpropagation blocks computations until the forward and backward passes are completed.
For temporal signals, this introduces high latency and hinders real-time learning.
In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time.
We show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training.
arXiv Detail & Related papers (2021-06-15T17:50:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.