Conditional Image-to-Video Generation with Latent Flow Diffusion Models
- URL: http://arxiv.org/abs/2303.13744v1
- Date: Fri, 24 Mar 2023 01:54:26 GMT
- Title: Conditional Image-to-Video Generation with Latent Flow Diffusion Models
- Authors: Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, Martin Renqiang Min
- Abstract summary: Conditional image-to-video (cI2V) generation aims to synthesize a new plausible video starting from an image and a condition.
We propose an approach for cI2V using novel latent flow diffusion models (LFDM)
LFDM synthesizes an optical flow sequence in the latent space based on the given condition to warp the given image.
- Score: 18.13991670747915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conditional image-to-video (cI2V) generation aims to synthesize a new
plausible video starting from an image (e.g., a person's face) and a condition
(e.g., an action class label like smile). The key challenge of the cI2V task
lies in the simultaneous generation of realistic spatial appearance and
temporal dynamics corresponding to the given image and condition. In this
paper, we propose an approach for cI2V using novel latent flow diffusion models
(LFDM) that synthesize an optical flow sequence in the latent space based on
the given condition to warp the given image. Compared to previous
direct-synthesis-based works, our proposed LFDM can better synthesize spatial
details and temporal motion by fully utilizing the spatial content of the given
image and warping it in the latent space according to the generated
temporally-coherent flow. The training of LFDM consists of two separate stages:
(1) an unsupervised learning stage to train a latent flow auto-encoder for
spatial content generation, including a flow predictor to estimate latent flow
between pairs of video frames, and (2) a conditional learning stage to train a
3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike
previous DMs operating in pixel space or latent feature space that couples
spatial and temporal information, the DM in our LFDM only needs to learn a
low-dimensional latent flow space for motion generation, thus being more
computationally efficient. We conduct comprehensive experiments on multiple
datasets, where LFDM consistently outperforms prior arts. Furthermore, we show
that LFDM can be easily adapted to new domains by simply finetuning the image
decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM.
Related papers
- S2DM: Sector-Shaped Diffusion Models for Video Generation [2.0270353391739637]
We propose a novel Sector-Shaped Diffusion Model (S2DM) for video generation.
S2DM can generate a group of intrinsically related data sharing the same semantic and intrinsically related features.
We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works.
arXiv Detail & Related papers (2024-03-20T08:50:15Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - Generative Modeling with Phase Stochastic Bridges [49.4474628881673]
Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs.
We introduce a novel generative modeling framework grounded in textbfphase space dynamics
Our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.
arXiv Detail & Related papers (2023-10-11T18:38:28Z) - Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands.
Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task.
We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - LiP-Flow: Learning Inference-time Priors for Codec Avatars via
Normalizing Flows in Latent Space [90.74976459491303]
We introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space.
A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective.
We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better.
arXiv Detail & Related papers (2022-03-15T13:22:57Z) - P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation [78.83305967085413]
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task.
Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
arXiv Detail & Related papers (2022-03-15T04:00:59Z) - C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal
Consistent Motion Transfer [5.220611885921671]
We propose Coarse-to-Fine Flow Warping Network (C2F-FWN) for spatial-temporal consistent HVMT.
C2F-FWN employs Flow Temporal Consistency (FTC) Loss to enhance temporal consistency.
Our approach outperforms state-of-art HVMT methods in terms of both spatial and temporal consistency.
arXiv Detail & Related papers (2020-12-16T14:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.