TEA: Temporal Excitation and Aggregation for Action Recognition
- URL: http://arxiv.org/abs/2004.01398v1
- Date: Fri, 3 Apr 2020 06:53:30 GMT
- Title: TEA: Temporal Excitation and Aggregation for Action Recognition
- Authors: Yan Li and Bin Ji and Xintian Shi and Jianguo Zhang and Bin Kang and
Limin Wang
- Abstract summary: We propose a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module.
For short-range motion modeling, the ME module calculates the feature-level temporal differences fromtemporal features.
The MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture.
- Score: 31.076707274791957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal modeling is key for action recognition in videos. It normally
considers both short-range motions and long-range aggregations. In this paper,
we propose a Temporal Excitation and Aggregation (TEA) block, including a
motion excitation (ME) module and a multiple temporal aggregation (MTA) module,
specifically designed to capture both short- and long-range temporal evolution.
In particular, for short-range motion modeling, the ME module calculates the
feature-level temporal differences from spatiotemporal features. It then
utilizes the differences to excite the motion-sensitive channels of the
features. The long-range temporal aggregations in previous works are typically
achieved by stacking a large number of local temporal convolutions. Each
convolution processes a local temporal window at a time. In contrast, the MTA
module proposes to deform the local convolution to a group of sub-convolutions,
forming a hierarchical residual architecture. Without introducing additional
parameters, the features will be processed with a series of sub-convolutions,
and each frame could complete multiple temporal aggregations with
neighborhoods. The final equivalent receptive field of temporal dimension is
accordingly enlarged, which is capable of modeling the long-range temporal
relationship over distant frames. The two components of the TEA block are
complementary in temporal modeling. Finally, our approach achieves impressive
results at low FLOPs on several action recognition benchmarks, such as
Kinetics, Something-Something, HMDB51, and UCF101, which confirms its
effectiveness and efficiency.
Related papers
- A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability.
We propose a Decoupled Scoupled Framework (DeST) to address the issues.
DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z) - Revisiting the Spatial and Temporal Modeling for Few-shot Action
Recognition [16.287968292213563]
We propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner.
We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51.
arXiv Detail & Related papers (2023-01-19T08:34:04Z) - FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification [49.06447472006251]
We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T21:15:58Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Long-Short Temporal Modeling for Efficient Action Recognition [32.159784061961886]
We propose a new two-stream action recognition network, termed as MENet, consisting of a Motion Enhancement (ME) module and a Video-level Aggregation (VLA) module.
For short-term motions, we design an efficient ME module to enhance the short-term motions by mingling the motion saliency among neighboring segments.
As for long-term aggregations, VLA is adopted at the top of the appearance branch to integrate the long-term dependencies across all segments.
arXiv Detail & Related papers (2021-06-30T02:54:13Z) - TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - PAN: Towards Fast Action Recognition via Learning Persistence of
Appearance [60.75488333935592]
Most state-of-the-art methods heavily rely on dense optical flow as motion representation.
In this paper, we shed light on fast action recognition by lifting the reliance on optical flow.
We design a novel motion cue called Persistence of Appearance (PA)
In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries.
arXiv Detail & Related papers (2020-08-08T07:09:54Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.