Every Angle Is Worth A Second Glance: Mining Kinematic Skeletal Structures from Multi-view Joint Cloud
- URL: http://arxiv.org/abs/2502.02936v1
- Date: Wed, 05 Feb 2025 07:02:28 GMT
- Title: Every Angle Is Worth A Second Glance: Mining Kinematic Skeletal Structures from Multi-view Joint Cloud
- Authors: Junkun Jiang, Jie Chen, Ho Yin Au, Mingyuan Chen, Wei Xue, Yike Guo,
- Abstract summary: Multi-person motion capture over sparse angular observations is a challenging problem under interference from both self- and mutual-occlusions.
We propose to triangulate between all same-typed 2D joints from all camera views regardless of their target ID, forming the Joint Cloud.
Joint Cloud consist of both valid joints lifted from the same joint type and target ID, as well as falsely constructed ones that are from different 2D sources.
- Score: 19.511737728909562
- License:
- Abstract: Multi-person motion capture over sparse angular observations is a challenging problem under interference from both self- and mutual-occlusions. Existing works produce accurate 2D joint detection, however, when these are triangulated and lifted into 3D, available solutions all struggle in selecting the most accurate candidates and associating them to the correct joint type and target identity. As such, in order to fully utilize all accurate 2D joint location information, we propose to independently triangulate between all same-typed 2D joints from all camera views regardless of their target ID, forming the Joint Cloud. Joint Cloud consist of both valid joints lifted from the same joint type and target ID, as well as falsely constructed ones that are from different 2D sources. These redundant and inaccurate candidates are processed over the proposed Joint Cloud Selection and Aggregation Transformer (JCSAT) involving three cascaded encoders which deeply explore the trajectile, skeletal structural, and view-dependent correlations among all 3D point candidates in the cross-embedding space. An Optimal Token Attention Path (OTAP) module is proposed which subsequently selects and aggregates informative features from these redundant observations for the final prediction of human motion. To demonstrate the effectiveness of JCSAT, we build and publish a new multi-person motion capture dataset BUMocap-X with complex interactions and severe occlusions. Comprehensive experiments over the newly presented as well as benchmark datasets validate the effectiveness of the proposed framework, which outperforms all existing state-of-the-art methods, especially under challenging occlusion scenarios.
Related papers
- GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.
We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.
GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z) - SEED: A Simple and Effective 3D DETR in Point Clouds [72.74016394325675]
We argue that the main challenges are challenging due to the high sparsity and uneven distribution of point clouds.
We propose a simple and effective 3D DETR method (SEED) for detecting 3D objects from point clouds.
arXiv Detail & Related papers (2024-07-15T14:21:07Z) - PIDS: Joint Point Interaction-Dimension Search for 3D Point Cloud [36.55716011085907]
PIDS is a novel paradigm to jointly explore point interactions and point dimensions to serve semantic segmentation on point cloud data.
We establish a large search space to jointly consider versatile point interactions and point dimensions.
We improve the search space exploration by leveraging predictor-based Neural Architecture Search (NAS) and enhance the quality of prediction.
arXiv Detail & Related papers (2022-11-28T20:35:22Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - A Dual-Masked Auto-Encoder for Robust Motion Capture with
Spatial-Temporal Skeletal Token Completion [13.88656793940129]
We propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity.
We then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion.
In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset.
arXiv Detail & Related papers (2022-07-15T10:00:43Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z) - MultiBodySync: Multi-Body Segmentation and Motion Estimation via 3D Scan
Synchronization [61.015704878681795]
We present a novel, end-to-end trainable multi-body motion segmentation and rigid registration framework for 3D point clouds.
The two non-trivial challenges posed by this multi-scan multibody setting are.
guaranteeing correspondence and segmentation consistency across multiple input point clouds and.
obtaining robust motion-based rigid body segmentation applicable to novel object categories.
arXiv Detail & Related papers (2021-01-17T06:36:28Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.