Temporal Pyramid Network for Action Recognition
- URL: http://arxiv.org/abs/2004.03548v2
- Date: Mon, 15 Jun 2020 02:05:13 GMT
- Title: Temporal Pyramid Network for Action Recognition
- Authors: Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, Bolei Zhou
- Abstract summary: We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks.
TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
- Score: 129.12076009042622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual tempo characterizes the dynamics and the temporal scale of an action.
Modeling such visual tempos of different actions facilitates their recognition.
Previous works often capture the visual tempo through sampling raw videos at
multiple rates and constructing an input-level frame pyramid, which usually
requires a costly multi-branch network to handle. In this work we propose a
generic Temporal Pyramid Network (TPN) at the feature-level, which can be
flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner.
Two essential components of TPN, the source of features and the fusion of
features, form a feature hierarchy for the backbone so that it can capture
action instances at various tempos. TPN also shows consistent improvements over
other challenging baselines on several action recognition datasets.
Specifically, when equipped with TPN, the 3D ResNet-50 with dense sampling
obtains a 2% gain on the validation set of Kinetics-400. A further analysis
also reveals that TPN gains most of its improvements on action classes that
have large variances in their visual tempos, validating the effectiveness of
TPN.
Related papers
- FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - SWTF: Sparse Weighted Temporal Fusion for Drone-Based Activity
Recognition [2.7677069267434873]
Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community.
We propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-11-10T12:45:43Z) - Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data.
This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features.
Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [37.25262046781015]
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos.
We propose a novel ConvTransformer network for action detection that efficiently captures both short-term and long-term temporal information.
Our network outperforms the state-of-the-art methods on all three datasets.
arXiv Detail & Related papers (2021-12-07T18:57:37Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Multi-Level Temporal Pyramid Network for Action Detection [47.223376232616424]
We propose a Multi-Level Temporal Network (MN) to improve the discrimination of the features.
By this means, the proposed MN can learn rich and discriminative features for different action instances with different durations.
We evaluate MN on two challenging datasets: THUMOS'14 and Activitynet v1.3, and the experimental results show that MN obtains competitive performance on Activitynet v1.3 and outperforms the state-of-the-art approaches on THUMOS'14 significantly.
arXiv Detail & Related papers (2020-08-07T17:08:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.