Related papers: 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

URL: http://arxiv.org/abs/2602.03796v1
Date: Tue, 03 Feb 2026 17:59:09 GMT
Title: 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
Authors: Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai,
Abstract summary: 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis.<n>3DiMo trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens.<n>Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control.
Score: 29.389246008057473
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

Related papers

Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions [23.080971732537886]
Mimic2DM is a novel motion imitation framework that learns the control policy directly from 2D keypoint trajectories extracted from videos.<n>We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains.
arXiv Detail & Related papers (2025-12-09T11:30:56Z)
DIMO: Diverse 3D Motion Generation for Arbitrary Objects [57.14954351767432]
DIMO is a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image.<n>We leverage the rich priors in well-trained video models to extract the common motion patterns.<n>During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass.
arXiv Detail & Related papers (2025-11-10T18:56:49Z)
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation [73.73984727616198]
We present Uni3C, a unified framework for precise control of both camera and human motion in video generation.<n>First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController.<n>Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters.
arXiv Detail & Related papers (2025-04-21T07:10:41Z)
Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation [43.915871360698546]
2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities.<n>We introduce a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data.<n>Our method efficiently utilizes 2D data, supporting realistic 3D human motion generation and broadening the range of motion types it supports.
arXiv Detail & Related papers (2024-12-17T17:34:52Z)
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation [83.98251722144195]
Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions.<n>We introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space.<n>We show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions.
arXiv Detail & Related papers (2024-12-10T18:55:13Z)
MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes [72.02827211293736]
MagicDrive3D is a novel framework for controllable 3D street scene generation.<n>It supports multi-condition control, including road maps, 3D objects, and text descriptions.<n>It generates diverse, high-quality 3D driving scenes, supports any-view rendering, and enhances downstream tasks like BEV segmentation.
arXiv Detail & Related papers (2024-05-23T12:04:51Z)
Decoupling Dynamic Monocular Videos for Dynamic View Synthesis [50.93409250217699]
We tackle the challenge of dynamic view synthesis from dynamic monocular videos in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints.
arXiv Detail & Related papers (2023-04-04T11:25:44Z)
MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks [77.56526918859345]
We present a novel framework that brings the 3D motion task from controlled environments to in-the-wild scenarios. It is capable of body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure.
arXiv Detail & Related papers (2021-12-19T07:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.