TAda! Temporally-Adaptive Convolutions for Video Understanding
- URL: http://arxiv.org/abs/2110.06178v1
- Date: Tue, 12 Oct 2021 17:25:07 GMT
- Title: TAda! Temporally-Adaptive Convolutions for Video Understanding
- Authors: Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang,
Ziwei Liu, Marcelo H. Ang Jr
- Abstract summary: adaptive weight calibration along the temporal dimension is an efficient way to facilitate modelling complex temporal dynamics in videos.
TAdaConv empowers the spatial convolutions with temporal modelling abilities by calibrating the convolution weights for each frame according to its local and global temporal context.
We construct TAda2D networks by replacing spatial convolutions in ResNet with TAdaConv, which leads to on par or better performance compared to state-of-the-art approaches on multiple video action recognition and localization benchmarks.
- Score: 17.24510667917993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial convolutions are widely used in numerous deep video models. It
fundamentally assumes spatio-temporal invariance, i.e., using shared weights
for every location in different frames. This work presents Temporally-Adaptive
Convolutions (TAdaConv) for video understanding, which shows that adaptive
weight calibration along the temporal dimension is an efficient way to
facilitate modelling complex temporal dynamics in videos. Specifically,
TAdaConv empowers the spatial convolutions with temporal modelling abilities by
calibrating the convolution weights for each frame according to its local and
global temporal context. Compared to previous temporal modelling operations,
TAdaConv is more efficient as it operates over the convolution kernels instead
of the features, whose dimension is an order of magnitude smaller than the
spatial resolutions. Further, the kernel calibration also brings an increased
model capacity. We construct TAda2D networks by replacing the spatial
convolutions in ResNet with TAdaConv, which leads to on par or better
performance compared to state-of-the-art approaches on multiple video action
recognition and localization benchmarks. We also demonstrate that as a readily
plug-in operation with negligible computation overhead, TAdaConv can
effectively improve many existing video models with a convincing margin. Codes
and models will be made available at
https://github.com/alibaba-mmai-research/pytorch-video-understanding.
Related papers
- CV-VAE: A Compatible Video VAE for Latent Generative Video Models [45.702473834294146]
Variationalencoders (VAE) plays a crucial role in OpenAI's Auto-temporal compression of videos.
Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models.
We propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE.
arXiv Detail & Related papers (2024-05-30T17:33:10Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Temporally-Adaptive Models for Efficient Video Understanding [36.413570840293005]
This work shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modeling complex temporal dynamics in videos.
Specifically, TAdaConv empowers spatial convolutions with temporal modeling abilities by calibrating the convolution weights for each frame according to its local and global temporal context.
Compared to existing operations for temporal modeling, TAdaConv is more efficient as it operates over the convolution kernels instead of the features, whose dimension is an order of magnitude smaller than the spatial resolutions.
arXiv Detail & Related papers (2023-08-10T17:35:47Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression [25.96187914295921]
This paper proposes a powerful entropy model which efficiently captures both spatial and temporal dependencies.
Our entropy model can achieve 18.2% saving on UVG dataset when compared with H266 (VTM) using the highest compression ratio.
arXiv Detail & Related papers (2022-07-13T00:03:54Z) - Stand-Alone Inter-Frame Attention in Video Models [164.06137994796487]
We present a new recipe of inter-frame attention block, namely Stand-alone Inter-temporal Attention (SIFA)
SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames.
We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer.
arXiv Detail & Related papers (2022-06-14T15:51:28Z) - Group Contextualization for Video Recognition [80.3842253625557]
Group contextualization (GC) can boost the performance of 2D-CNN (e.g., TSN) and TSM.
GC embeds feature with four different kinds of contexts in parallel.
Group contextualization can boost the performance of 2D-CNN (e.g., TSN) to a level comparable to the state-the-art video networks.
arXiv Detail & Related papers (2022-03-18T01:49:40Z) - VA-RED$^2$: Video Adaptive Redundancy Reduction [64.75692128294175]
We present a redundancy reduction framework, VA-RED$2$, which is input-dependent.
We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism.
Our framework achieves $20% - 40%$ reduction in computation (FLOPs) when compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-02-15T22:57:52Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.