Related papers: Mocap-2-to-3: Lifting 2D Diffusion-Based Pretrained Models for 3D Motion Capture

Mocap-2-to-3: Lifting 2D Diffusion-Based Pretrained Models for 3D Motion Capture

URL: http://arxiv.org/abs/2503.03222v2
Date: Thu, 06 Mar 2025 14:32:49 GMT
Title: Mocap-2-to-3: Lifting 2D Diffusion-Based Pretrained Models for 3D Motion Capture
Authors: Zhumei Wang, Zechen Hu, Ruoxi Guo, Huaijin Pi, Ziyong Feng, Sida Peng, Xiaowei Zhou,
Abstract summary: Mocap-2-to-3 is a novel framework that decomposes intricate 3D motions into 2D poses.<n>We leverage 2D data to enhance 3D motion reconstruction in diverse scenarios.<n>We evaluate our model's performance on real-world datasets.
Score: 31.82852393452607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recovering absolute poses in the world coordinate system from monocular views presents significant challenges. Two primary issues arise in this context. Firstly, existing methods rely on 3D motion data for training, which requires collection in limited environments. Acquiring such 3D labels for new actions in a timely manner is impractical, severely restricting the model's generalization capabilities. In contrast, 2D poses are far more accessible and easier to obtain. Secondly, estimating a person's absolute position in metric space from a single viewpoint is inherently more complex. To address these challenges, we introduce Mocap-2-to-3, a novel framework that decomposes intricate 3D motions into 2D poses, leveraging 2D data to enhance 3D motion reconstruction in diverse scenarios and accurately predict absolute positions in the world coordinate system. We initially pretrain a single-view diffusion model with extensive 2D data, followed by fine-tuning a multi-view diffusion model for view consistency using publicly available 3D data. This strategy facilitates the effective use of large-scale 2D data. Additionally, we propose an innovative human motion representation that decouples local actions from global movements and encodes geometric priors of the ground, ensuring the generative model learns accurate motion priors from 2D data. During inference, this allows for the gradual recovery of global movements, resulting in more plausible positioning. We evaluate our model's performance on real-world datasets, demonstrating superior accuracy in motion and absolute human positioning compared to state-of-the-art methods, along with enhanced generalization and scalability. Our code will be made publicly available.

Related papers

UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z)
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models [85.76211596755151]
We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling. We propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics.
arXiv Detail & Related papers (2025-04-07T17:59:33Z)
xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion [4.878192303432336]
DIOD-3D is the first baseline for multi-object discovery in 3D data using 2D motion. xMOD is a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. Our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets.
arXiv Detail & Related papers (2025-03-19T09:20:35Z)
Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning. UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z)
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z)
Lifting Motion to the 3D World via 2D Diffusion [19.64801640086107]
We introduce MVLift, a novel approach to predict global 3D motion using only 2D pose sequences for training.<n> MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses.
arXiv Detail & Related papers (2024-11-27T23:26:56Z)
Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion. We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases. Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment [45.74813582690906]
Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. We present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment. The VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos.
arXiv Detail & Related papers (2024-04-15T06:38:09Z)
Realistic Human Motion Generation with Cross-Diffusion Models [30.854425772128568]
Cross Human Motion Diffusion Model (CrossDiff) Method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model. CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences.
arXiv Detail & Related papers (2023-12-18T07:44:40Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Progressive Multi-view Human Mesh Recovery with Self-Supervision [68.60019434498703]
Existing solutions typically suffer from poor generalization performance to new settings. We propose a novel simulation-based training pipeline for multi-view human mesh recovery.
arXiv Detail & Related papers (2022-12-10T06:28:29Z)
Deep Generative Models on 3D Representations: A Survey [81.73385191402419]
Generative models aim to learn the distribution of observed data by generating new instances. Recently, researchers started to shift focus from 2D to 3D space. representing 3D data poses significantly greater challenges.
arXiv Detail & Related papers (2022-10-27T17:59:50Z)
MotionBERT: A Unified Perspective on Learning Human Motion Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z)
VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data [69.64723752430244]
We introduce VirtualPose, a two-stage learning framework to exploit the hidden "free lunch" specific to this task. The first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses. It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses.
arXiv Detail & Related papers (2022-07-20T14:47:28Z)
Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution [34.301501457959056]
We propose a temporal regression network with a gated convolution module to transform 2D joints to 3D. A simple yet effective localization approach is also conducted to transform the normalized pose to the global trajectory. Our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods.
arXiv Detail & Related papers (2020-10-31T04:35:24Z)
Cascaded deep monocular 3D human pose estimation with evolutionary training data [76.3478675752847]
Deep representation learning has achieved remarkable accuracy for monocular 3D human pose estimation. This paper proposes a novel data augmentation method that is scalable for massive amount of training data. Our method synthesizes unseen 3D human skeletons based on a hierarchical human representation and synthesizings inspired by prior knowledge.
arXiv Detail & Related papers (2020-06-14T03:09:52Z)
Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the Wild [101.70320427145388]
We propose a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data. We evaluate our proposed approach on two large scale datasets.
arXiv Detail & Related papers (2020-03-17T08:47:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.