A Circular Window-based Cascade Transformer for Online Action Detection
- URL: http://arxiv.org/abs/2208.14209v1
- Date: Tue, 30 Aug 2022 12:37:23 GMT
- Title: A Circular Window-based Cascade Transformer for Online Action Detection
- Authors: Shuqiang Cao, Weixin Luo, Bairui Wang, Wei Zhang, Lin Ma
- Abstract summary: We advocate a novel and efficient principle for online action detection.
It merely updates the latest and oldest historical representations in one window but reuses the intermediate ones, which have been already computed.
Based on this principle, we introduce a window-based cascade Transformer with a circular historical queue, where it conducts multi-stage attentions and cascade refinement on each window.
- Score: 27.880350187125778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online action detection aims at the accurate action prediction of the current
frame based on long historical observations. Meanwhile, it demands real-time
inference on online streaming videos. In this paper, we advocate a novel and
efficient principle for online action detection. It merely updates the latest
and oldest historical representations in one window but reuses the intermediate
ones, which have been already computed. Based on this principle, we introduce a
window-based cascade Transformer with a circular historical queue, where it
conducts multi-stage attentions and cascade refinement on each window. We also
explore the association between online action detection and its counterpart
offline action segmentation as an auxiliary task. We find that such an extra
supervision helps discriminative history clustering and acts as feature
augmentation for better training the classifier and cascade refinement. Our
proposed method achieves the state-of-the-art performances on three challenging
datasets THUMOS'14, TVSeries, and HDD. Codes will be available after
acceptance.
Related papers
- Online Action Representation using Change Detection and Symbolic Programming [0.3937354192623676]
The proposed method employs a change detection algorithm to automatically segment action sequences.
We show the effectiveness of this representation in the downstream task of class repetition detection.
The results of the experiments demonstrate that, despite operating online, the proposed method performs better or on par with the existing method.
arXiv Detail & Related papers (2024-05-19T10:31:59Z) - SimOn: A Simple Framework for Online Temporal Action Localization [51.27476730635852]
We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture.
Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
arXiv Detail & Related papers (2022-11-08T04:50:54Z) - Continual Transformers: Redundancy-Free Attention for Online Inference [86.3361797111839]
We propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference in a continual input stream.
Our modification is purely to the order of computations, while the produced outputs and learned weights are identical to those of the original Multi-Head Attention.
arXiv Detail & Related papers (2022-01-17T08:20:09Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Online Spatiotemporal Action Detection and Prediction via Causal
Representations [1.9798034349981157]
We start with the conversion of the traditional offline action detection pipeline into an online action tube detection system.
We explore the future prediction capabilities of such detection methods by extending an existing action tube into the future by regression.
Later, we seek to establish that online/temporalusal representations can achieve similar performance to that of offline three dimensional (3D) convolutional neural networks (CNNs) on various tasks.
arXiv Detail & Related papers (2020-08-31T17:28:51Z) - WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos [124.72839555467944]
We propose a weakly supervised framework that can be trained using only video-class labels.
We show that our method largely outperforms weakly-supervised baselines.
When strongly supervised, our method obtains the state-of-the-art results in the tasks of both online per-frame action recognition and online detection of action start.
arXiv Detail & Related papers (2020-06-05T23:08:41Z) - Gabriella: An Online System for Real-Time Activity Detection in
Untrimmed Security Videos [72.50607929306058]
We propose a real-time online system to perform activity detection on untrimmed security videos.
The proposed method consists of three stages: tubelet extraction, activity classification and online tubelet merging.
We demonstrate the effectiveness of the proposed approach in terms of speed (100 fps) and performance with state-of-the-art results.
arXiv Detail & Related papers (2020-04-23T22:20:10Z) - Two-Stream AMTnet for Action Detection [12.581710073789848]
We propose a new deep neural network architecture for online action detection, termed ream to the original appearance one in AMTnet.
Two-Stream AMTnet exhibits superior action detection performance over state-of-the-art approaches on the standard action detection benchmarks.
arXiv Detail & Related papers (2020-04-03T12:16:45Z) - A Novel Online Action Detection Framework from Untrimmed Video Streams [19.895434487276578]
We propose a novel online action detection framework that considers actions as a set of temporally ordered subclasses.
We augment our data by varying the lengths of videos to allow the proposed method to learn about the high intra-class variation in human actions.
arXiv Detail & Related papers (2020-03-17T14:11:24Z) - Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition [69.9658249941149]
Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost.
We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
arXiv Detail & Related papers (2020-02-09T11:09:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.