Temporally-Adaptive Models for Efficient Video Understanding
- URL: http://arxiv.org/abs/2308.05787v1
- Date: Thu, 10 Aug 2023 17:35:47 GMT
- Title: Temporally-Adaptive Models for Efficient Video Understanding
- Authors: Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Yingya Zhang, Ziwei
Liu, Marcelo H. Ang Jr
- Abstract summary: This work shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modeling complex temporal dynamics in videos.
Specifically, TAdaConv empowers spatial convolutions with temporal modeling abilities by calibrating the convolution weights for each frame according to its local and global temporal context.
Compared to existing operations for temporal modeling, TAdaConv is more efficient as it operates over the convolution kernels instead of the features, whose dimension is an order of magnitude smaller than the spatial resolutions.
- Score: 36.413570840293005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial convolutions are extensively used in numerous deep video models. It
fundamentally assumes spatio-temporal invariance, i.e., using shared weights
for every location in different frames. This work presents Temporally-Adaptive
Convolutions (TAdaConv) for video understanding, which shows that adaptive
weight calibration along the temporal dimension is an efficient way to
facilitate modeling complex temporal dynamics in videos. Specifically, TAdaConv
empowers spatial convolutions with temporal modeling abilities by calibrating
the convolution weights for each frame according to its local and global
temporal context. Compared to existing operations for temporal modeling,
TAdaConv is more efficient as it operates over the convolution kernels instead
of the features, whose dimension is an order of magnitude smaller than the
spatial resolutions. Further, kernel calibration brings an increased model
capacity. Based on this readily plug-in operation TAdaConv as well as its
extension, i.e., TAdaConvV2, we construct TAdaBlocks to empower ConvNeXt and
Vision Transformer to have strong temporal modeling capabilities. Empirical
results show TAdaConvNeXtV2 and TAdaFormer perform competitively against
state-of-the-art convolutional and Transformer-based models in various video
understanding benchmarks. Our codes and models are released at:
https://github.com/alibaba-mmai-research/TAdaConv.
Related papers
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.
A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.
An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - CV-VAE: A Compatible Video VAE for Latent Generative Video Models [45.702473834294146]
Variationalencoders (VAE) plays a crucial role in OpenAI's Auto-temporal compression of videos.
Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models.
We propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE.
arXiv Detail & Related papers (2024-05-30T17:33:10Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Stand-Alone Inter-Frame Attention in Video Models [164.06137994796487]
We present a new recipe of inter-frame attention block, namely Stand-alone Inter-temporal Attention (SIFA)
SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames.
We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer.
arXiv Detail & Related papers (2022-06-14T15:51:28Z) - TAda! Temporally-Adaptive Convolutions for Video Understanding [17.24510667917993]
adaptive weight calibration along the temporal dimension is an efficient way to facilitate modelling complex temporal dynamics in videos.
TAdaConv empowers the spatial convolutions with temporal modelling abilities by calibrating the convolution weights for each frame according to its local and global temporal context.
We construct TAda2D networks by replacing spatial convolutions in ResNet with TAdaConv, which leads to on par or better performance compared to state-of-the-art approaches on multiple video action recognition and localization benchmarks.
arXiv Detail & Related papers (2021-10-12T17:25:07Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.