A Dual-Masked Auto-Encoder for Robust Motion Capture with
Spatial-Temporal Skeletal Token Completion
- URL: http://arxiv.org/abs/2207.07381v1
- Date: Fri, 15 Jul 2022 10:00:43 GMT
- Title: A Dual-Masked Auto-Encoder for Robust Motion Capture with
Spatial-Temporal Skeletal Token Completion
- Authors: Junkun Jiang, Jie Chen, Yike Guo
- Abstract summary: We propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity.
We then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion.
In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset.
- Score: 13.88656793940129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-person motion capture can be challenging due to ambiguities caused by
severe occlusion, fast body movement, and complex interactions. Existing
frameworks build on 2D pose estimations and triangulate to 3D coordinates via
reasoning the appearance, trajectory, and geometric consistencies among
multi-camera observations. However, 2D joint detection is usually incomplete
and with wrong identity assignments due to limited observation angle, which
leads to noisy 3D triangulation results. To overcome this issue, we propose to
explore the short-range autoregressive characteristics of skeletal motion using
transformer. First, we propose an adaptive, identity-aware triangulation module
to reconstruct 3D joints and identify the missing joints for each identity. To
generate complete 3D skeletal motion, we then propose a Dual-Masked
Auto-Encoder (D-MAE) which encodes the joint status with both
skeletal-structural and temporal position encoding for trajectory completion.
D-MAE's flexible masking and encoding mechanism enable arbitrary skeleton
definitions to be conveniently deployed under the same framework. In order to
demonstrate the proposed model's capability in dealing with severe data loss
scenarios, we contribute a high-accuracy and challenging motion capture dataset
of multi-person interactions with severe occlusion. Evaluations on both
benchmark and our new dataset demonstrate the efficiency of our proposed model,
as well as its advantage against the other state-of-the-art methods.
Related papers
- Occlusion-Aware 3D Motion Interpretation for Abnormal Behavior Detection [10.782354892545651]
We present OAD2D, which discriminates against motion abnormalities based on reconstructing 3D coordinates of mesh vertices and human joints from monocular videos.
We reformulate the abnormal posture estimation by coupling it with Motion to Text (M2T) model in which, the VQVAE is employed to quantize motion features.
Our approach demonstrates the robustness of abnormal behavior detection against severe and self-occlusions, as it reconstructs human motion trajectories in global coordinates.
arXiv Detail & Related papers (2024-07-23T18:41:16Z) - 3D Face Modeling via Weakly-supervised Disentanglement Network joint Identity-consistency Prior [62.80458034704989]
Generative 3D face models featuring disentangled controlling factors hold immense potential for diverse applications in computer vision and computer graphics.
Previous 3D face modeling methods face a challenge as they demand specific labels to effectively disentangle these factors.
This paper introduces a Weakly-Supervised Disentanglement Framework, denoted as WSDF, to facilitate the training of controllable 3D face models without an overly stringent labeling requirement.
arXiv Detail & Related papers (2024-04-25T11:50:47Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Geometry-Biased Transformer for Robust Multi-View 3D Human Pose
Reconstruction [3.069335774032178]
We propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences.
We conduct experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons.
arXiv Detail & Related papers (2023-12-28T16:30:05Z) - RSB-Pose: Robust Short-Baseline Binocular 3D Human Pose Estimation with Occlusion Handling [19.747618899243555]
We set our sights on a short-baseline binocular setting that offers both portability and a geometric measurement property.
As the binocular baseline shortens, two serious challenges emerge: first, the robustness of 3D reconstruction against 2D errors deteriorates.
We propose the Stereo Co-Keypoints Estimation module to improve the view consistency of 2D keypoints and enhance the 3D robustness.
arXiv Detail & Related papers (2023-11-24T01:15:57Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - Realistic Full-Body Tracking from Sparse Observations via Joint-Level
Modeling [13.284947022380404]
We propose a two-stage framework that can obtain accurate and smooth full-body motions with three tracking signals of head and hands only.
Our framework explicitly models the joint-level features in the first stage and utilizes them astemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage.
With extensive experiments on the AMASS motion dataset and real-captured data, we show our proposed method can achieve more accurate and smooth motion compared to existing approaches.
arXiv Detail & Related papers (2023-08-17T08:27:55Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.