Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space
- URL: http://arxiv.org/abs/2406.11253v1
- Date: Mon, 17 Jun 2024 06:31:19 GMT
- Title: Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space
- Authors: Yuan Wang, Zhao Wang, Junhao Gong, Di Huang, Tong He, Wanli Ouyang, Jile Jiao, Xuetao Feng, Qi Dou, Shixiang Tang, Dan Xu,
- Abstract summary: We present $textbfHolistic-Motion2D$, the first comprehensive and large-scale benchmark for 2D whole-body motion generation.
We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion.
- Score: 78.95579123031733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a novel path to $\textit{general}$ human motion generation by focusing on 2D space. Traditional methods have primarily generated human motions in 3D, which, while detailed and realistic, are often limited by the scope of available 3D motion data in terms of both the size and the diversity. To address these limitations, we exploit extensive availability of 2D motion data. We present $\textbf{Holistic-Motion2D}$, the first comprehensive and large-scale benchmark for 2D whole-body motion generation, which includes over 1M in-the-wild motion sequences, each paired with high-quality whole-body/partial pose annotations and textual descriptions. Notably, Holistic-Motion2D is ten times larger than the previously largest 3D motion dataset. We also introduce a baseline method, featuring innovative $\textit{whole-body part-aware attention}$ and $\textit{confidence-aware modeling}$ techniques, tailored for 2D $\underline{\text T}$ext-driv$\underline{\text{EN}}$ whole-bo$\underline{\text D}$y motion gen$\underline{\text{ER}}$ation, namely $\textbf{Tender}$. Extensive experiments demonstrate the effectiveness of $\textbf{Holistic-Motion2D}$ and $\textbf{Tender}$ in generating expressive, diverse, and realistic human motions. We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion. The page link is: https://holistic-motion2d.github.io.
Related papers
- xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion [4.878192303432336]
DIOD-3D is the first baseline for multi-object discovery in 3D data using 2D motion.
xMOD is a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues.
Our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets.
arXiv Detail & Related papers (2025-03-19T09:20:35Z) - Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.
UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z) - Mocap-2-to-3: Lifting 2D Diffusion-Based Pretrained Models for 3D Motion Capture [31.82852393452607]
Mocap-2-to-3 is a novel framework that decomposes intricate 3D motions into 2D poses.
We leverage 2D data to enhance 3D motion reconstruction in diverse scenarios.
We evaluate our model's performance on real-world datasets.
arXiv Detail & Related papers (2025-03-05T06:32:49Z) - Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation [43.915871360698546]
2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities.
We introduce a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data.
Our method efficiently utilizes 2D data, supporting realistic 3D human motion generation and broadening the range of motion types it supports.
arXiv Detail & Related papers (2024-12-17T17:34:52Z) - Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text [61.9973218744157]
We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories.
Experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.
arXiv Detail & Related papers (2024-06-25T14:42:51Z) - SpatialTracker: Tracking Any 2D Pixels in 3D Space [71.58016288648447]
We propose to estimate point trajectories in 3D space to mitigate the issues caused by image projection.
Our method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth estimators.
Tracking in 3D allows us to leverage as-rigid-as-possible (ARAP) constraints while simultaneously learning a rigidity embedding that clusters pixels into different rigid parts.
arXiv Detail & Related papers (2024-04-05T17:59:25Z) - Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D [95.14469865815768]
2D vision models can be used for semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets.
However, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task.
In this paper, we propose Lift3D, which trains to predict unseen views on feature spaces generated by a few visual models.
We even outperform state-of-the-art methods specialized for the task in question.
arXiv Detail & Related papers (2024-03-27T18:13:16Z) - Realistic Human Motion Generation with Cross-Diffusion Models [30.854425772128568]
Cross Human Motion Diffusion Model (CrossDiff)
Method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model.
CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences.
arXiv Detail & Related papers (2023-12-18T07:44:40Z) - RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering
Supervision [36.15913507034939]
We present RenderOcc, a novel paradigm for training 3D occupancy models only using 2D labels.
Specifically, we extract a NeRF-style 3D volume representation from multi-view images.
We employ volume rendering techniques to establish 2D renderings, thus enabling direct 3D supervision from 2D semantics and depth labels.
arXiv Detail & Related papers (2023-09-18T06:08:15Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - Action2video: Generating Videos of Human 3D Actions [31.665831044217363]
We aim to tackle the interesting yet challenging problem of generating videos of diverse and natural human motions from prescribed action categories.
Key issue lies in the ability to synthesize multiple distinct motion sequences that are realistic in their visual appearances.
Action2motionally generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos.
arXiv Detail & Related papers (2021-11-12T20:20:37Z) - M3D-VTON: A Monocular-to-3D Virtual Try-On Network [62.77413639627565]
Existing 3D virtual try-on methods mainly rely on annotated 3D human shapes and garment templates.
We propose a novel Monocular-to-3D Virtual Try-On Network (M3D-VTON) that builds on the merits of both 2D and 3D approaches.
arXiv Detail & Related papers (2021-08-11T10:05:17Z) - Deep Monocular 3D Human Pose Estimation via Cascaded Dimension-Lifting [10.336146336350811]
3D pose estimation from a single image is a challenging problem due to depth ambiguity.
One type of the previous methods lifts 2D joints, obtained by resorting to external 2D pose detectors, to the 3D space.
We propose a novel end-to-end framework that exploits the contextual information but also produces the output directly in the 3D space.
arXiv Detail & Related papers (2021-04-08T05:44:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.