Two-Stream AMTnet for Action Detection
- URL: http://arxiv.org/abs/2004.01494v1
- Date: Fri, 3 Apr 2020 12:16:45 GMT
- Title: Two-Stream AMTnet for Action Detection
- Authors: Suman Saha, Gurkirt Singh and Fabio Cuzzolin
- Abstract summary: We propose a new deep neural network architecture for online action detection, termed ream to the original appearance one in AMTnet.
Two-Stream AMTnet exhibits superior action detection performance over state-of-the-art approaches on the standard action detection benchmarks.
- Score: 12.581710073789848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose Two-Stream AMTnet, which leverages recent advances
in video-based action representation[1] and incremental action tube
generation[2]. Majority of the present action detectors follow a frame-based
representation, a late-fusion followed by an offline action tube building
steps. These are sub-optimal as: frame-based features barely encode the
temporal relations; late-fusion restricts the network to learn robust
spatiotemporal features; and finally, an offline action tube generation is not
suitable for many real-world problems such as autonomous driving, human-robot
interaction to name a few. The key contributions of this work are: (1)
combining AMTnet's 3D proposal architecture with an online action tube
generation technique which allows the model to learn stronger temporal features
needed for accurate action detection and facilitates running inference online;
(2) an efficient fusion technique allowing the deep network to learn strong
spatiotemporal action representations. This is achieved by augmenting the
previous Action Micro-Tube (AMTnet) action detection framework in three
distinct ways: by adding a parallel motion stIn this paper, we propose a new
deep neural network architecture for online action detection, termed ream to
the original appearance one in AMTnet; (2) in opposition to state-of-the-art
action detectors which train appearance and motion streams separately, and use
a test time late fusion scheme to fuse RGB and flow cues, by jointly training
both streams in an end-to-end fashion and merging RGB and optical flow features
at training time; (3) by introducing an online action tube generation algorithm
which works at video-level, and in real-time (when exploiting only appearance
features). Two-Stream AMTnet exhibits superior action detection performance
over state-of-the-art approaches on the standard action detection benchmarks.
Related papers
- STMixer: A One-Stage Sparse Action Detector [43.62159663367588]
We propose two core designs for a more flexible one-stage action detector.
First, we sparse a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of features from the entire video-temporal domain.
Second, we devise a decoupled feature mixing module, which dynamically attends to mixes along the spatial and temporal dimensions respectively for better feature decoding.
arXiv Detail & Related papers (2024-04-15T14:52:02Z) - Action Recognition with Multi-stream Motion Modeling and Mutual
Information Maximization [44.73161606369333]
Action recognition is a fundamental and intriguing problem in artificial intelligence.
We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention.
Our approach sets the new state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2023-06-13T06:56:09Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Spatiotemporal Deformable Models for Long-Term Complex Activity
Detection [23.880673582575856]
Long-term complex activity recognition can be crucial for autonomous systems such as cars and surgical robots.
Most current methods are designed to merely localise short-term action/activities or combinations of actions that only last for a few frames or seconds.
Our framework consists of three main building blocks: (i) action detection, (ii) the modelling of the deformable geometry of parts, and (iii) a sparsity mechanism.
arXiv Detail & Related papers (2021-04-16T16:05:34Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - Online Spatiotemporal Action Detection and Prediction via Causal
Representations [1.9798034349981157]
We start with the conversion of the traditional offline action detection pipeline into an online action tube detection system.
We explore the future prediction capabilities of such detection methods by extending an existing action tube into the future by regression.
Later, we seek to establish that online/temporalusal representations can achieve similar performance to that of offline three dimensional (3D) convolutional neural networks (CNNs) on various tasks.
arXiv Detail & Related papers (2020-08-31T17:28:51Z) - WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos [124.72839555467944]
We propose a weakly supervised framework that can be trained using only video-class labels.
We show that our method largely outperforms weakly-supervised baselines.
When strongly supervised, our method obtains the state-of-the-art results in the tasks of both online per-frame action recognition and online detection of action start.
arXiv Detail & Related papers (2020-06-05T23:08:41Z) - Gabriella: An Online System for Real-Time Activity Detection in
Untrimmed Security Videos [72.50607929306058]
We propose a real-time online system to perform activity detection on untrimmed security videos.
The proposed method consists of three stages: tubelet extraction, activity classification and online tubelet merging.
We demonstrate the effectiveness of the proposed approach in terms of speed (100 fps) and performance with state-of-the-art results.
arXiv Detail & Related papers (2020-04-23T22:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.