Learning Scene Flow With Skeleton Guidance For 3D Action Recognition
- URL: http://arxiv.org/abs/2306.13285v1
- Date: Fri, 23 Jun 2023 04:14:25 GMT
- Title: Learning Scene Flow With Skeleton Guidance For 3D Action Recognition
- Authors: Vasileios Magoulianitis, Athanasios Psaltis
- Abstract summary: This work demonstrates the use of 3D flow sequence by a deeptemporal model for 3D action recognition.
An extended deep skeleton is also introduced to learn the most discriminant action motion dynamics.
A late fusion scheme is adopted between the two models for learning the high level cross-modal correlations.
- Score: 1.5954459915735735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Among the existing modalities for 3D action recognition, 3D flow has been
poorly examined, although conveying rich motion information cues for human
actions. Presumably, its susceptibility to noise renders it intractable, thus
challenging the learning process within deep models. This work demonstrates the
use of 3D flow sequence by a deep spatiotemporal model and further proposes an
incremental two-level spatial attention mechanism, guided from skeleton domain,
for emphasizing motion features close to the body joint areas and according to
their informativeness. Towards this end, an extended deep skeleton model is
also introduced to learn the most discriminant action motion dynamics, so as to
estimate an informativeness score for each joint. Subsequently, a late fusion
scheme is adopted between the two models for learning the high level
cross-modal correlations. Experimental results on the currently largest and
most challenging dataset NTU RGB+D, demonstrate the effectiveness of the
proposed approach, achieving state-of-the-art results.
Related papers
- 4D Contrastive Superflows are Dense 3D Representation Learners [62.433137130087445]
We introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing pretraining objectives.
To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances alignment of the knowledge distilled from camera views.
arXiv Detail & Related papers (2024-07-08T17:59:54Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Exploring Latent Cross-Channel Embedding for Accurate 3D Human Pose
Reconstruction in a Diffusion Framework [6.669850111205944]
Monocular 3D human pose estimation poses significant challenges due to inherent depth ambiguities that arise during the reprojection process from 2D to 3D.
Recent advancements in diffusion models have shown promise in incorporating structural priors to address reprojection ambiguities.
We propose a novel cross-channel embedding framework that aims to fully explore the correlation between joint-level features of 3D coordinates and their 2D projections.
arXiv Detail & Related papers (2024-01-18T09:53:03Z) - Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion
Modeling [83.76377808476039]
We propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior.
Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton.
A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence.
arXiv Detail & Related papers (2023-08-18T16:41:57Z) - Realistic Full-Body Tracking from Sparse Observations via Joint-Level
Modeling [13.284947022380404]
We propose a two-stage framework that can obtain accurate and smooth full-body motions with three tracking signals of head and hands only.
Our framework explicitly models the joint-level features in the first stage and utilizes them astemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage.
With extensive experiments on the AMASS motion dataset and real-captured data, we show our proposed method can achieve more accurate and smooth motion compared to existing approaches.
arXiv Detail & Related papers (2023-08-17T08:27:55Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Dynamical Deep Generative Latent Modeling of 3D Skeletal Motion [15.359134407309726]
Our model decomposes highly correlated skeleton data into a set of few spatial basis of switching temporal processes.
This results in a dynamical deep generative latent model that parses the meaningful intrinsic states in the dynamics of 3D pose data.
arXiv Detail & Related papers (2021-06-18T23:58:49Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.