Multi-Level Temporal Pyramid Network for Action Detection
- URL: http://arxiv.org/abs/2008.03270v1
- Date: Fri, 7 Aug 2020 17:08:24 GMT
- Title: Multi-Level Temporal Pyramid Network for Action Detection
- Authors: Xiang Wang, Changxin Gao, Shiwei Zhang, and Nong Sang
- Abstract summary: We propose a Multi-Level Temporal Network (MN) to improve the discrimination of the features.
By this means, the proposed MN can learn rich and discriminative features for different action instances with different durations.
We evaluate MN on two challenging datasets: THUMOS'14 and Activitynet v1.3, and the experimental results show that MN obtains competitive performance on Activitynet v1.3 and outperforms the state-of-the-art approaches on THUMOS'14 significantly.
- Score: 47.223376232616424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, one-stage frameworks have been widely applied for temporal action
detection, but they still suffer from the challenge that the action instances
span a wide range of time. The reason is that these one-stage detectors, e.g.,
Single Shot Multi-Box Detector (SSD), extract temporal features only applying a
single-level layer for each head, which is not discriminative enough to perform
classification and regression. In this paper, we propose a Multi-Level Temporal
Pyramid Network (MLTPN) to improve the discrimination of the features.
Specially, we first fuse the features from multiple layers with different
temporal resolutions, to encode multi-layer temporal information. We then apply
a multi-level feature pyramid architecture on the features to enhance their
discriminative abilities. Finally, we design a simple yet effective feature
fusion module to fuse the multi-level multi-scale features. By this means, the
proposed MLTPN can learn rich and discriminative features for different action
instances with different durations. We evaluate MLTPN on two challenging
datasets: THUMOS'14 and Activitynet v1.3, and the experimental results show
that MLTPN obtains competitive performance on Activitynet v1.3 and outperforms
the state-of-the-art approaches on THUMOS'14 significantly.
Related papers
- FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network [19.466279425330857]
We propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone.
Our work was submitted to ACM MM in April 2024, but was rejected.
arXiv Detail & Related papers (2024-07-23T02:27:52Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z) - Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data.
This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features.
Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [37.25262046781015]
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos.
We propose a novel ConvTransformer network for action detection that efficiently captures both short-term and long-term temporal information.
Our network outperforms the state-of-the-art methods on all three datasets.
arXiv Detail & Related papers (2021-12-07T18:57:37Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks.
TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.