LEO: Generative Latent Image Animator for Human Video Synthesis
- URL: http://arxiv.org/abs/2305.03989v2
- Date: Wed, 11 Oct 2023 10:26:27 GMT
- Title: LEO: Generative Latent Image Animator for Human Video Synthesis
- Authors: Yaohui Wang, Xin Ma, Xinyuan Chen, Antitza Dantcheva, Bo Dai, Yu Qiao
- Abstract summary: We propose a novel framework for human video synthesis, placing emphasis on synthesizing-temporal coherency.
Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance.
We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM)
- Score: 42.925592662547814
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Spatio-temporal coherency is a major challenge in synthesizing high quality
videos, particularly in synthesizing human videos that contain rich global and
local deformations. To resolve this challenge, previous approaches have
resorted to different features in the generation process aimed at representing
appearance and motion. However, in the absence of strict mechanisms to
guarantee such disentanglement, a separation of motion from appearance has
remained challenging, resulting in spatial distortions and temporal jittering
that break the spatio-temporal coherency. Motivated by this, we here propose
LEO, a novel framework for human video synthesis, placing emphasis on
spatio-temporal coherency. Our key idea is to represent motion as a sequence of
flow maps in the generation process, which inherently isolate motion from
appearance. We implement this idea via a flow-based image animator and a Latent
Motion Diffusion Model (LMDM). The former bridges a space of motion codes with
the space of flow maps, and synthesizes video frames in a warp-and-inpaint
manner. LMDM learns to capture motion prior in the training data by
synthesizing sequences of motion codes. Extensive quantitative and qualitative
analysis suggests that LEO significantly improves coherent synthesis of human
videos over previous methods on the datasets TaichiHD, FaceForensics and
CelebV-HQ. In addition, the effective disentanglement of appearance and motion
in LEO allows for two additional tasks, namely infinite-length human video
synthesis, as well as content-preserving video editing.
Related papers
- Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation [15.569467643817447]
We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations.
We train on real-world videos enhanced with this innovative motion depiction approach.
To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy.
arXiv Detail & Related papers (2024-05-26T00:53:26Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z) - MoVideo: Motion-Aware Video Generation with Diffusion Models [97.03352319694795]
We propose a novel motion-aware generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow.
MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
arXiv Detail & Related papers (2023-11-19T13:36:03Z) - LaMD: Latent Motion Diffusion for Video Generation [69.4111397077229]
latent motion diffusion (LaMD) framework consists of a motion-decomposed video autoencoder and a diffusion-based motion generator.
Results show that LaMD generates high-quality videos with a wide range of motions, from dynamics to highly controllable movements.
arXiv Detail & Related papers (2023-04-23T10:32:32Z) - Dance In the Wild: Monocular Human Animation with Neural Dynamic
Appearance Synthesis [56.550999933048075]
We propose a video based synthesis method that tackles challenges and demonstrates high quality results for in-the-wild videos.
We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes.
We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2021-11-10T20:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.