Related papers: TAda! Temporally-Adaptive Convolutions for Video Understanding

TAda! Temporally-Adaptive Convolutions for Video Understanding

URL: http://arxiv.org/abs/2110.06178v1
Date: Tue, 12 Oct 2021 17:25:07 GMT
Title: TAda! Temporally-Adaptive Convolutions for Video Understanding
Authors: Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang, Ziwei Liu, Marcelo H. Ang Jr
Abstract summary: adaptive weight calibration along the temporal dimension is an efficient way to facilitate modelling complex temporal dynamics in videos. TAdaConv empowers the spatial convolutions with temporal modelling abilities by calibrating the convolution weights for each frame according to its local and global temporal context. We construct TAda2D networks by replacing spatial convolutions in ResNet with TAdaConv, which leads to on par or better performance compared to state-of-the-art approaches on multiple video action recognition and localization benchmarks.
Score: 17.24510667917993
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatial convolutions are widely used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in different frames. This work presents Temporally-Adaptive Convolutions (TAdaConv) for video understanding, which shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modelling complex temporal dynamics in videos. Specifically, TAdaConv empowers the spatial convolutions with temporal modelling abilities by calibrating the convolution weights for each frame according to its local and global temporal context. Compared to previous temporal modelling operations, TAdaConv is more efficient as it operates over the convolution kernels instead of the features, whose dimension is an order of magnitude smaller than the spatial resolutions. Further, the kernel calibration also brings an increased model capacity. We construct TAda2D networks by replacing the spatial convolutions in ResNet with TAdaConv, which leads to on par or better performance compared to state-of-the-art approaches on multiple video action recognition and localization benchmarks. We also demonstrate that as a readily plug-in operation with negligible computation overhead, TAdaConv can effectively improve many existing video models with a convincing margin. Codes and models will be made available at https://github.com/alibaba-mmai-research/pytorch-video-understanding.

Related papers

Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z)
CV-VAE: A Compatible Video VAE for Latent Generative Video Models [45.702473834294146]
Variationalencoders (VAE) plays a crucial role in OpenAI's Auto-temporal compression of videos. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models. We propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE.
arXiv Detail & Related papers (2024-05-30T17:33:10Z)
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos. The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z)
Temporally-Adaptive Models for Efficient Video Understanding [36.413570840293005]
This work shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modeling complex temporal dynamics in videos. Specifically, TAdaConv empowers spatial convolutions with temporal modeling abilities by calibrating the convolution weights for each frame according to its local and global temporal context. Compared to existing operations for temporal modeling, TAdaConv is more efficient as it operates over the convolution kernels instead of the features, whose dimension is an order of magnitude smaller than the spatial resolutions.
arXiv Detail & Related papers (2023-08-10T17:35:47Z)
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts. Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention. We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z)
Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression [25.96187914295921]
This paper proposes a powerful entropy model which efficiently captures both spatial and temporal dependencies. Our entropy model can achieve 18.2% saving on UVG dataset when compared with H266 (VTM) using the highest compression ratio.
arXiv Detail & Related papers (2022-07-13T00:03:54Z)
Stand-Alone Inter-Frame Attention in Video Models [164.06137994796487]
We present a new recipe of inter-frame attention block, namely Stand-alone Inter-temporal Attention (SIFA) SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer.
arXiv Detail & Related papers (2022-06-14T15:51:28Z)
Group Contextualization for Video Recognition [80.3842253625557]
Group contextualization (GC) can boost the performance of 2D-CNN (e.g., TSN) and TSM. GC embeds feature with four different kinds of contexts in parallel. Group contextualization can boost the performance of 2D-CNN (e.g., TSN) to a level comparable to the state-the-art video networks.
arXiv Detail & Related papers (2022-03-18T01:49:40Z)
VA-RED$^2$: Video Adaptive Redundancy Reduction [64.75692128294175]
We present a redundancy reduction framework, VA-RED$2$, which is input-dependent. We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism. Our framework achieves $20% - 40%$ reduction in computation (FLOPs) when compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-02-15T22:57:52Z)
TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map. Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter. Such a design enables efficient-temporal modeling and maintains a small model scale. STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.