DirecFormer: A Directed Attention in Transformer Approach to Robust
Action Recognition
- URL: http://arxiv.org/abs/2203.10233v1
- Date: Sat, 19 Mar 2022 03:41:48 GMT
- Title: DirecFormer: A Directed Attention in Transformer Approach to Robust
Action Recognition
- Authors: Thanh-Dat Truong, Quoc-Huy Bui, Chi Nhan Duong, Han-Seok Seo, Son Lam
Phung, Xin Li, Khoa Luu
- Abstract summary: This work presents a novel end-to-end Transformer-based Directed Attention framework for robust action recognition.
The contributions of this work are three-fold. Firstly, we introduce the problem of ordered temporal learning issues to the action recognition problem.
Secondly, a new Directed Attention mechanism is introduced to understand and provide attentions to human actions in the right order.
- Score: 22.649489578944838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human action recognition has recently become one of the popular research
topics in the computer vision community. Various 3D-CNN based methods have been
presented to tackle both the spatial and temporal dimensions in the task of
video action recognition with competitive results. However, these methods have
suffered some fundamental limitations such as lack of robustness and
generalization, e.g., how does the temporal ordering of video frames affect the
recognition results? This work presents a novel end-to-end Transformer-based
Directed Attention (DirecFormer) framework for robust action recognition. The
method takes a simple but novel perspective of Transformer-based approach to
understand the right order of sequence actions. Therefore, the contributions of
this work are three-fold. Firstly, we introduce the problem of ordered temporal
learning issues to the action recognition problem. Secondly, a new Directed
Attention mechanism is introduced to understand and provide attentions to human
actions in the right order. Thirdly, we introduce the conditional dependency in
action sequence modeling that includes orders and classes. The proposed
approach consistently achieves the state-of-the-art (SOTA) results compared
with the recent action recognition methods, on three standard large-scale
benchmarks, i.e. Jester, Kinetics-400 and Something-Something-V2.
Related papers
- Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action
Recognition from Egocentric RGB Videos [50.74218823358754]
We develop a transformer-based framework to exploit temporal information for robust estimation.
We build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation.
Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O.
arXiv Detail & Related papers (2022-09-20T05:52:54Z) - ActAR: Actor-Driven Pose Embeddings for Video Action Recognition [12.043574473965318]
Human action recognition (HAR) in videos is one of the core tasks of video understanding.
We propose a new method that simultaneously learns to recognize efficiently human actions in the infrared spectrum.
arXiv Detail & Related papers (2022-04-19T05:12:24Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - Zero-Shot Action Recognition with Transformer-based Video Semantic
Embedding [36.24563211765782]
We take a new comprehensive look at the inductive zero-shot action recognition problem from a realistic standpoint.
Specifically, we advocate for a concrete formulation for zero-shot action recognition that avoids an exact overlap between the training and testing classes.
We propose a novel end-to-end trained transformer model which is capable of capturing long rangetemporal dependencies efficiently.
arXiv Detail & Related papers (2022-03-10T05:03:58Z) - Temporal Shuffling for Defending Deep Action Recognition Models against
Adversarial Attacks [67.58887471137436]
We develop a novel defense method using temporal shuffling of input videos against adversarial attacks for action recognition models.
To the best of our knowledge, this is the first attempt to design a defense method without additional training for 3D CNN-based video action recognition models.
arXiv Detail & Related papers (2021-12-15T06:57:01Z) - Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition [69.9658249941149]
Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost.
We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
arXiv Detail & Related papers (2020-02-09T11:09:56Z) - Delving into 3D Action Anticipation from Streaming Videos [99.0155538452263]
Action anticipation aims to recognize the action with a partial observation.
We introduce several complementary evaluation metrics and present a basic model based on frame-wise action classification.
We also explore multi-task learning strategies by incorporating auxiliary information from two aspects: the full action representation and the class-agnostic action label.
arXiv Detail & Related papers (2019-06-15T10:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.