RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space
- URL: http://arxiv.org/abs/2508.08588v1
- Date: Tue, 12 Aug 2025 03:02:23 GMT
- Title: RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space
- Authors: Jingyun Liang, Jingkai Zhou, Shikai Li, Chenjie Cao, Lei Sun, Yichen Qian, Weihua Chen, Fan Wang,
- Abstract summary: We propose a framework that explicitly decouples motion from appearance, from background, and action from trajectory.<n>Our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
- Score: 28.70181587812075
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
Related papers
- Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions [23.080971732537886]
Mimic2DM is a novel motion imitation framework that learns the control policy directly from 2D keypoint trajectories extracted from videos.<n>We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains.
arXiv Detail & Related papers (2025-12-09T11:30:56Z) - Generative Video Motion Editing with 3D Point Tracks [66.55707897151909]
We present a track-conditioned V2V framework that enables joint editing of camera and object motion.<n>We achieve this by conditioning a model on a source video and paired 3D point tracks representing source and target motions.<n>Our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation.
arXiv Detail & Related papers (2025-12-01T18:59:55Z) - ATI: Any Trajectory Instruction for Controllable Video Generation [25.249489701215467]
We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion.<n>Our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models.
arXiv Detail & Related papers (2025-05-28T23:49:18Z) - CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation [76.72787726497343]
We present CineMaster, a framework for 3D-aware and controllable text-to-video generation.<n>Our goal is to empower users with comparable controllability as professional film directors.
arXiv Detail & Related papers (2025-02-12T18:55:36Z) - VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation [62.64811405314847]
VidCRAFT3 is a novel framework for precise image-to-video generation.<n>It enables control over camera motion, object motion, and lighting direction simultaneously.<n>It produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence.
arXiv Detail & Related papers (2025-02-11T13:11:59Z) - MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation [65.74312406211213]
This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation.<n>By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis.
arXiv Detail & Related papers (2025-02-06T18:41:04Z) - Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation [21.87745390965703]
We introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control.<n> Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user instructions, and perceive it from different viewpoints.<n>Our framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way.
arXiv Detail & Related papers (2025-01-09T07:23:48Z) - LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis [80.2461057573121]
In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory.<n>We propose a pioneering method for 3D trajectory control in image-to-video by abstracting object masks into a few cluster points.<n>Experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.
arXiv Detail & Related papers (2024-12-19T18:59:56Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - Image Conductor: Precision Control for Interactive Video Synthesis [90.2353794019393]
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements.
Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image.
arXiv Detail & Related papers (2024-06-21T17:55:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.