Related papers: Multi-Level Temporal Pyramid Network for Action Detection

Multi-Level Temporal Pyramid Network for Action Detection

URL: http://arxiv.org/abs/2008.03270v1
Date: Fri, 7 Aug 2020 17:08:24 GMT
Title: Multi-Level Temporal Pyramid Network for Action Detection
Authors: Xiang Wang, Changxin Gao, Shiwei Zhang, and Nong Sang
Abstract summary: We propose a Multi-Level Temporal Network (MN) to improve the discrimination of the features. By this means, the proposed MN can learn rich and discriminative features for different action instances with different durations. We evaluate MN on two challenging datasets: THUMOS'14 and Activitynet v1.3, and the experimental results show that MN obtains competitive performance on Activitynet v1.3 and outperforms the state-of-the-art approaches on THUMOS'14 significantly.
Score: 47.223376232616424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Currently, one-stage frameworks have been widely applied for temporal action detection, but they still suffer from the challenge that the action instances span a wide range of time. The reason is that these one-stage detectors, e.g., Single Shot Multi-Box Detector (SSD), extract temporal features only applying a single-level layer for each head, which is not discriminative enough to perform classification and regression. In this paper, we propose a Multi-Level Temporal Pyramid Network (MLTPN) to improve the discrimination of the features. Specially, we first fuse the features from multiple layers with different temporal resolutions, to encode multi-layer temporal information. We then apply a multi-level feature pyramid architecture on the features to enhance their discriminative abilities. Finally, we design a simple yet effective feature fusion module to fuse the multi-level multi-scale features. By this means, the proposed MLTPN can learn rich and discriminative features for different action instances with different durations. We evaluate MLTPN on two challenging datasets: THUMOS'14 and Activitynet v1.3, and the experimental results show that MLTPN obtains competitive performance on Activitynet v1.3 and outperforms the state-of-the-art approaches on THUMOS'14 significantly.

Related papers

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing. It is designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network [19.466279425330857]
We propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone. Our work was submitted to ACM MM in April 2024, but was rejected.
arXiv Detail & Related papers (2024-07-23T02:27:52Z)
Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs. Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction. We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z)
Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules. MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips. We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z)
Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data. This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features. Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z)
Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features. We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z)
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [37.25262046781015]
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. We propose a novel ConvTransformer network for action detection that efficiently captures both short-term and long-term temporal information. Our network outperforms the state-of-the-art methods on all three datasets.
arXiv Detail & Related papers (2021-12-07T18:57:37Z)
Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections. The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks. TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.