TAM: Temporal Adaptive Module for Video Recognition
- URL: http://arxiv.org/abs/2005.06803v3
- Date: Wed, 18 Aug 2021 12:19:06 GMT
- Title: TAM: Temporal Adaptive Module for Video Recognition
- Authors: Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, Tong Lu
- Abstract summary: temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
- Score: 60.83208364110288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video data is with complex temporal dynamics due to various factors such as
camera motion, speed variation, and different activities. To effectively
capture this diverse motion pattern, this paper presents a new temporal
adaptive module ({\bf TAM}) to generate video-specific temporal kernels based
on its own feature map. TAM proposes a unique two-level adaptive modeling
scheme by decoupling the dynamic kernel into a location sensitive importance
map and a location invariant aggregation weight. The importance map is learned
in a local temporal window to capture short-term information, while the
aggregation weight is generated from a global view with a focus on long-term
structure. TAM is a modular block and could be integrated into 2D CNNs to yield
a powerful video architecture (TANet) with a very small extra computational
cost. The extensive experiments on Kinetics-400 and Something-Something
datasets demonstrate that our TAM outperforms other temporal modeling methods
consistently, and achieves the state-of-the-art performance under the similar
complexity. The code is available at \url{
https://github.com/liu-zhy/temporal-adaptive-module}.
Related papers
- Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - TDN: Temporal Difference Networks for Efficient Action Recognition [31.922001043405924]
This paper presents a new video architecture, termed as Temporal Difference Network (TDN)
The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator.
Our TDN presents a new state of the art on the Something-Something V1 & V2 datasets and is on par with the best performance on the Kinetics-400 dataset.
arXiv Detail & Related papers (2020-12-18T06:31:08Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - TEA: Temporal Excitation and Aggregation for Action Recognition [31.076707274791957]
We propose a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module.
For short-range motion modeling, the ME module calculates the feature-level temporal differences fromtemporal features.
The MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture.
arXiv Detail & Related papers (2020-04-03T06:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.