Related papers: Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

URL: http://arxiv.org/abs/2412.02261v1
Date: Tue, 03 Dec 2024 08:34:41 GMT
Title: Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis
Authors: Jingyu Gong, Chong Zhang, Fengqi Liu, Ke Fan, Qianyu Zhou, Xin Tan, Zhizhong Zhang, Yuan Xie, Lizhuang Ma,
Abstract summary: We propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis.<n>In this framework, we disentangle human-scene interaction from motion synthesis during training.<n>We show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods.
Score: 48.65197562914734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human motion generation is a long-standing problem, and scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data whose quantity is limited. Meanwhile, it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this framework, we disentangle human-scene interaction from motion synthesis during training and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. The proposed implicit policy optimizes the intermediate noised motion in a GAN Inversion manner to maintain motion continuity and control keyframe poses though the ControlNet branch and motion inpainting. For long-term motion synthesis, we introduce motion blending for stable transitions between multiple sub-tasks, where motions are fused in rotation power space and translation linear space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. https://jingyugong.github.io/DiffusionImplicitPolicy/

Related papers

Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories. We translate high-level user requests into detailed, semi-dense motion prompts. We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z)
Motion Dreamer: Boundary Conditional Motion Reasoning for Physically Coherent Video Generation [27.690736225683825]
We introduce Motion Dreamer, a two-stage framework that explicitly separates motion reasoning from visual synthesis. Our approach introduces instance flow, a sparse-to-dense motion representation enabling effective integration of partial user-defined motions. Experiments demonstrate that Motion Dreamer significantly outperforms existing methods, achieving superior motion plausibility and visual realism.
arXiv Detail & Related papers (2024-11-30T17:40:49Z)
KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis. We propose a novel motion representation that decomposes motion into distinct body joint group movements.
arXiv Detail & Related papers (2024-11-23T06:50:11Z)
DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception [54.02566476357383]
We propose the first Dynamic Environment MOtion Synthesis framework (DEMOS) to predict future motion instantly according to the current scene. We then use it to dynamically update the latent motion for final motion synthesis. The results show our method outperforms previous works significantly and has great performance in handling dynamic environments.
arXiv Detail & Related papers (2024-03-04T05:38:16Z)
Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling [74.62570964142063]
Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions. We propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods. Our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream.
arXiv Detail & Related papers (2023-08-03T16:18:32Z)
LEO: Generative Latent Image Animator for Human Video Synthesis [38.99490968487773]
We propose a novel framework for human video synthesis, placing emphasis on synthesizing-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM)
arXiv Detail & Related papers (2023-05-06T09:29:12Z)
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis. We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework. We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model [35.32967411186489]
MotionDiffuse is a diffusion model-based text-driven motion generation framework. It excels at modeling complicated data distribution and generating vivid motion sequences. It responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts.
arXiv Detail & Related papers (2022-08-31T17:58:54Z)
Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis [117.15586710830489]
We focus on the problem of synthesizing diverse scene-aware human motions under the guidance of target action sequences. Based on this factorized scheme, a hierarchical framework is proposed, with each sub-module responsible for modeling one aspect. Experiment results show that the proposed framework remarkably outperforms previous methods in terms of diversity and naturalness.
arXiv Detail & Related papers (2022-05-25T18:20:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.