Related papers: Lifting Motion to the 3D World via 2D Diffusion

Lifting Motion to the 3D World via 2D Diffusion

URL: http://arxiv.org/abs/2411.18808v1
Date: Wed, 27 Nov 2024 23:26:56 GMT
Title: Lifting Motion to the 3D World via 2D Diffusion
Authors: Jiaman Li, C. Karen Liu, Jiajun Wu,
Abstract summary: We introduce MVLift, a novel approach to predict global 3D motion using only 2D pose sequences for training.<n> MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses.
Score: 19.64801640086107
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

Related papers

Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos [71.24593306228145]
We propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields.<n>We address this issue through test-time refinement, which helps the model to focus on specific frames, thereby reducing the data complexity.<n>This demonstrates that 3D techniques can enhance 2D analysis even for dynamic phenomena in a challenging and realistic setting.
arXiv Detail & Related papers (2025-06-05T19:46:48Z)
xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion [4.878192303432336]
DIOD-3D is the first baseline for multi-object discovery in 3D data using 2D motion. xMOD is a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. Our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets.
arXiv Detail & Related papers (2025-03-19T09:20:35Z)
Mocap-2-to-3: Lifting 2D Diffusion-Based Pretrained Models for 3D Motion Capture [31.82852393452607]
Mocap-2-to-3 is a novel framework that decomposes intricate 3D motions into 2D poses. We leverage 2D data to enhance 3D motion reconstruction in diverse scenarios. We evaluate our model's performance on real-world datasets.
arXiv Detail & Related papers (2025-03-05T06:32:49Z)
Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation [43.915871360698546]
2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. We introduce a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. Our method efficiently utilizes 2D data, supporting realistic 3D human motion generation and broadening the range of motion types it supports.
arXiv Detail & Related papers (2024-12-17T17:34:52Z)
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation [83.98251722144195]
Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions. We introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space. We show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions.
arXiv Detail & Related papers (2024-12-10T18:55:13Z)
Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion. We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases. Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z)
Investigating the impact of 2D gesture representation on co-speech gesture generation [5.408549711581793]
We evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model.
arXiv Detail & Related papers (2024-06-21T12:59:20Z)
SpatialTracker: Tracking Any 2D Pixels in 3D Space [71.58016288648447]
We propose to estimate point trajectories in 3D space to mitigate the issues caused by image projection. Our method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth estimators. Tracking in 3D allows us to leverage as-rigid-as-possible (ARAP) constraints while simultaneously learning a rigidity embedding that clusters pixels into different rigid parts.
arXiv Detail & Related papers (2024-04-05T17:59:25Z)
Realistic Human Motion Generation with Cross-Diffusion Models [30.854425772128568]
Cross Human Motion Diffusion Model (CrossDiff) Method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model. CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences.
arXiv Detail & Related papers (2023-12-18T07:44:40Z)
MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion [57.90404618420159]
We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers.
arXiv Detail & Related papers (2023-10-23T09:05:18Z)
Deep Generative Models on 3D Representations: A Survey [81.73385191402419]
Generative models aim to learn the distribution of observed data by generating new instances. Recently, researchers started to shift focus from 2D to 3D space. representing 3D data poses significantly greater challenges.
arXiv Detail & Related papers (2022-10-27T17:59:50Z)
MotionBERT: A Unified Perspective on Learning Human Motion Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z)
MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks [77.56526918859345]
We present a novel framework that brings the 3D motion task from controlled environments to in-the-wild scenarios. It is capable of body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure.
arXiv Detail & Related papers (2021-12-19T07:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.