An Efficient Spatio-Temporal Pyramid Transformer for Action Detection
- URL: http://arxiv.org/abs/2207.10448v1
- Date: Thu, 21 Jul 2022 12:38:05 GMT
- Title: An Efficient Spatio-Temporal Pyramid Transformer for Action Detection
- Authors: Yuetian Weng, Zizheng Pan, Mingfei Han, Xiaojun Chang, Bohan Zhuang
- Abstract summary: We present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) video framework for action detection.
Specifically, we propose to use local window attention to encode local-temporal rich-time representations in the early stages while applying global attention to capture long-term space-time dependencies in the later stages.
In this way, ourSTPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency.
- Score: 40.68615998427292
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of action detection aims at deducing both the action category and
localization of the start and end moment for each action instance in a long,
untrimmed video. While vision Transformers have driven the recent advances in
video understanding, it is non-trivial to design an efficient architecture for
action detection due to the prohibitively expensive self-attentions over a long
sequence of video clips. To this end, we present an efficient hierarchical
Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon
the fact that the early self-attention layers in Transformers still focus on
local patterns. Specifically, we propose to use local window attention to
encode rich local spatio-temporal representations in the early stages while
applying global attention modules to capture long-term space-time dependencies
in the later stages. In this way, our STPT can encode both locality and
dependency with largely reduced redundancy, delivering a promising trade-off
between accuracy and efficiency. For example, with only RGB input, the proposed
STPT achieves 53.6% mAP on THUMOS14, surpassing I3D+AFSD RGB model by over 10%
and performing favorably against state-of-the-art AFSD that uses additional
flow features with 31% fewer GFLOPs, which serves as an effective and efficient
end-to-end Transformer-based framework for action detection.
Related papers
- Cross-Cluster Shifting for Efficient and Effective 3D Object Detection
in Autonomous Driving [69.20604395205248]
We present a new 3D point-based detector model, named Shift-SSD, for precise 3D object detection in autonomous driving.
We introduce an intriguing Cross-Cluster Shifting operation to unleash the representation capacity of the point-based detector.
We conduct extensive experiments on the KITTI, runtime, and nuScenes datasets, and the results demonstrate the state-of-the-art performance of Shift-SSD.
arXiv Detail & Related papers (2024-03-10T10:36:32Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning [5.236787242129767]
We present a novel 3D Transformer, called Point-Voxel Transformer (PVT) that leverages self-attention computation in points to gather global context features.
Our method fully exploits the potentials of Transformer architecture, paving the road to efficient and accurate recognition results.
arXiv Detail & Related papers (2021-08-13T06:07:57Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - Relaxed Transformer Decoders for Direct Action Proposal Generation [30.516462193231888]
This paper presents a simple and end-to-end learnable framework (RTD-Net) for direct action proposal generation.
To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR)
Experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net.
arXiv Detail & Related papers (2021-02-03T06:29:28Z) - Actions as Moving Points [66.21507857877756]
We present a conceptually simple, efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector)
Based on the insight that movement information could simplify and assist action tubelet detection, our MOC-detector is composed of three crucial head branches.
Our MOC-detector outperforms the existing state-of-the-art methods for both metrics of frame-mAP and video-mAP on the JHMDB and UCF101-24 datasets.
arXiv Detail & Related papers (2020-01-14T03:29:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.