Semantic Latent Motion for Portrait Video Generation
- URL: http://arxiv.org/abs/2503.10096v1
- Date: Thu, 13 Mar 2025 06:43:21 GMT
- Title: Semantic Latent Motion for Portrait Video Generation
- Authors: Qiyuan Zhang, Chenyu Wu, Wenzhang Sun, Huaize Liu, Donglin Di, Wei Chen, Changqing Zou,
- Abstract summary: Semantic Latent Motion (SeMo) is a compact and expressive motion representation.<n>SeMo follows an effective three-step framework: Abstraction, Reasoning, and Generation.<n>Our approach surpasses state-of-the-art models with an 81% win rate in realism.
- Score: 19.56640370303683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in portrait video generation have been noteworthy. However, existing methods rely heavily on human priors and pre-trained generation models, which may introduce unrealistic motion and lead to inefficient inference. To address these challenges, we propose Semantic Latent Motion (SeMo), a compact and expressive motion representation. Leveraging this representation, our approach achieve both high-quality visual results and efficient inference. SeMo follows an effective three-step framework: Abstraction, Reasoning, and Generation. First, in the Abstraction step, we use a carefully designed Mask Motion Encoder to compress the subject's motion state into a compact and abstract latent motion (1D token). Second, in the Reasoning step, long-term modeling and efficient reasoning are performed in this latent space to generate motion sequences. Finally, in the Generation step, the motion dynamics serve as conditional information to guide the generation model in synthesizing realistic transitions from reference frames to target frames. Thanks to the compact and descriptive nature of Semantic Latent Motion, our method enables real-time video generation with highly realistic motion. User studies demonstrate that our approach surpasses state-of-the-art models with an 81% win rate in realism. Extensive experiments further highlight its strong compression capability, reconstruction quality, and generative potential. Moreover, its fully self-supervised nature suggests promising applications in broader video generation tasks.
Related papers
- RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos.
We advocate for the incorporation of a retrieval mechanism during the generation phase.
Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z) - VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.<n>VideoJAM achieves state-of-the-art performance in motion coherence.<n>These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models [59.10171699717122]
MoTrans is a customized motion transfer method enabling video generation of similar motion in new context.<n> multimodal representations from recaptioned prompt and video frames promote the modeling of appearance.<n>Our method effectively learns specific motion pattern from singular or multiple reference videos.
arXiv Detail & Related papers (2024-12-02T10:07:59Z) - Motion Dreamer: Boundary Conditional Motion Reasoning for Physically Coherent Video Generation [27.690736225683825]
We introduce Motion Dreamer, a two-stage framework that explicitly separates motion reasoning from visual synthesis.<n>Our approach introduces instance flow, a sparse-to-dense motion representation enabling effective integration of partial user-defined motions.<n>Experiments demonstrate that Motion Dreamer significantly outperforms existing methods, achieving superior motion plausibility and visual realism.
arXiv Detail & Related papers (2024-11-30T17:40:49Z) - LaMD: Latent Motion Diffusion for Video Generation [69.4111397077229]
latent motion diffusion (LaMD) framework consists of a motion-decomposed video autoencoder and a diffusion-based motion generator.
Results show that LaMD generates high-quality videos with a wide range of motions, from dynamics to highly controllable movements.
arXiv Detail & Related papers (2023-04-23T10:32:32Z) - High-Fidelity Neural Human Motion Transfer from Monocular Video [71.75576402562247]
Video-based human motion transfer creates video animations of humans following a source motion.
We present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations.
In the experimental results, we significantly outperform the state-of-the-art in terms of video realism.
arXiv Detail & Related papers (2020-12-20T16:54:38Z) - Dynamic Future Net: Diversified Human Motion Generation [31.987602940970888]
Human motion modelling is crucial in many areas such as computer graphics, vision and virtual reality.
We present Dynamic Future Net, a new deep learning model where we explicitly focuses on the intrinsic motionity of human motion dynamics.
Our model can generate a large number of high-quality motions with arbitrary duration, and visuallyincing variations in both space and time.
arXiv Detail & Related papers (2020-08-25T02:31:41Z) - Hierarchical Style-based Networks for Motion Synthesis [150.226137503563]
We propose a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location.
Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner.
On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion.
arXiv Detail & Related papers (2020-08-24T02:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.