Cross-Modal Learning with 3D Deformable Attention for Action Recognition
- URL: http://arxiv.org/abs/2212.05638v3
- Date: Thu, 17 Aug 2023 07:23:45 GMT
- Title: Cross-Modal Learning with 3D Deformable Attention for Action Recognition
- Authors: Sangwon Kim and Dasom Ahn and Byoung Chul Ko
- Abstract summary: We propose a new 3D deformable transformer for action recognition with adaptive attention fields and a cross-temporal learning scheme.
The proposed 3D deformable transformer was tested on the.
60,.120 FineGYM, and PennActionAction datasets, and showed results better than or similar to pre-trained state-of-the-art methods.
- Score: 4.128256616073278
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: An important challenge in vision-based action recognition is the embedding of
spatiotemporal features with two or more heterogeneous modalities into a single
feature. In this study, we propose a new 3D deformable transformer for action
recognition with adaptive spatiotemporal receptive fields and a cross-modal
learning scheme. The 3D deformable transformer consists of three attention
modules: 3D deformability, local joint stride, and temporal stride attention.
The two cross-modal tokens are input into the 3D deformable attention module to
create a cross-attention token with a reflected spatiotemporal correlation.
Local joint stride attention is applied to spatially combine attention and pose
tokens. Temporal stride attention temporally reduces the number of input tokens
in the attention module and supports temporal expression learning without the
simultaneous use of all tokens. The deformable transformer iterates L-times and
combines the last cross-modal token for classification. The proposed 3D
deformable transformer was tested on the NTU60, NTU120, FineGYM, and PennAction
datasets, and showed results better than or similar to pre-trained
state-of-the-art methods even without a pre-training process. In addition, by
visualizing important joints and correlations during action recognition through
spatial joint and temporal stride attention, the possibility of achieving an
explainable potential for action recognition is presented.
Related papers
- Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - EmMixformer: Mix transformer for eye movement recognition [43.75206776070943]
We propose a mixed transformer termed EmMixformer to extract time and frequency domain information for eye movement recognition.
We are the first to attempt leveraging transformer to learn long temporal dependencies within eye movement.
As the three modules provide complementary feature representations in terms of local and global dependencies, the proposed EmMixformer is capable of improving recognition accuracy.
arXiv Detail & Related papers (2024-01-10T06:45:37Z) - Cross-BERT for Point Cloud Pretraining [61.762046503448936]
We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT.
To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction.
Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
arXiv Detail & Related papers (2023-12-08T08:18:12Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Points to Patches: Enabling the Use of Self-Attention for 3D Shape
Recognition [19.89482062012177]
We propose a two-stage Point Transformer-in-Transformer (Point-TnT) approach which combines local and global attention mechanisms.
Experiments on shape classification show that such an approach provides more useful features for downstream tasks than the baseline Transformer.
We also extend our method to feature matching for scene reconstruction, showing that it can be used in conjunction with existing scene reconstruction pipelines.
arXiv Detail & Related papers (2022-04-08T09:31:24Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - Ripple Attention for Visual Perception with Sub-quadratic Complexity [7.425337104538644]
Transformer architectures are now central to modeling in natural language processing tasks.
We propose ripple attention, a sub-quadratic attention mechanism for visual perception.
In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space.
arXiv Detail & Related papers (2021-10-06T02:00:38Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.