Continuous-Time Video Generation via Learning Motion Dynamics with
Neural ODE
- URL: http://arxiv.org/abs/2112.10960v1
- Date: Tue, 21 Dec 2021 03:30:38 GMT
- Title: Continuous-Time Video Generation via Learning Motion Dynamics with
Neural ODE
- Authors: Kangyeol Kim, Sunghyun Park, Junsoo Lee, Joonseok Lee, Sookyung Kim,
Jaegul Choo, Edward Choi
- Abstract summary: We propose a novel video generation approach that learns separate distributions for motion and appearance.
We employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector.
- Score: 26.13198266911874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In order to perform unconditional video generation, we must learn the
distribution of the real-world videos. In an effort to synthesize high-quality
videos, various studies attempted to learn a mapping function between noise and
videos, including recent efforts to separate motion distribution and appearance
distribution. Previous methods, however, learn motion dynamics in discretized,
fixed-interval timesteps, which is contrary to the continuous nature of motion
of a physical body. In this paper, we propose a novel video generation approach
that learns separate distributions for motion and appearance, the former
modeled by neural ODE to learn natural motion dynamics. Specifically, we employ
a two-stage approach where the first stage converts a noise vector to a
sequence of keypoints in arbitrary frame rates, and the second stage
synthesizes videos based on the given keypoints sequence and the appearance
noise vector. Our model not only quantitatively outperforms recent baselines
for video generation, but also demonstrates versatile functionality such as
dynamic frame rate manipulation and motion transfer between two datasets, thus
opening new doors to diverse video generation applications.
Related papers
- Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency [15.841490425454344]
We propose an end-to-end audio-only conditioned video diffusion model named Loopy.
Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information.
arXiv Detail & Related papers (2024-09-04T11:55:14Z) - Unfolding Videos Dynamics via Taylor Expansion [5.723852805622308]
We present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi)
ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence.
ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings.
arXiv Detail & Related papers (2024-09-04T01:41:09Z) - Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation [15.569467643817447]
We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations.
We train on real-world videos enhanced with this innovative motion depiction approach.
To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy.
arXiv Detail & Related papers (2024-05-26T00:53:26Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Customizing Motion in Text-to-Video Diffusion Models [79.4121510826141]
We introduce an approach for augmenting text-to-video generation models with customized motions.
By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
arXiv Detail & Related papers (2023-12-07T18:59:03Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - Diverse Dance Synthesis via Keyframes with Transformer Controllers [10.23813069057791]
We propose a novel motion-based motion generation network based on multiple constraints, which can achieve diverse dance synthesis via learned knowledge.
The backbone of our network is a hierarchical RNN module composed of two long short-term memory (LSTM) units, in which the first LSTM is utilized to embed the posture information of the historical frames into a latent space.
Our framework contains two Transformer-based controllers, which are used to model the constraints of the root trajectory and the velocity factor respectively.
arXiv Detail & Related papers (2022-07-13T00:56:46Z) - Dynamic View Synthesis from Dynamic Monocular Video [69.80425724448344]
We present an algorithm for generating views at arbitrary viewpoints and any input time step given a monocular video of a dynamic scene.
We show extensive quantitative and qualitative results of dynamic view synthesis from casually captured videos.
arXiv Detail & Related papers (2021-05-13T17:59:50Z) - Dual-MTGAN: Stochastic and Deterministic Motion Transfer for
Image-to-Video Synthesis [38.41763708731513]
We propose Dual Motion Transfer GAN (Dual-MTGAN), which takes image and video data as inputs while learning disentangled content and motion representations.
Our Dual-MTGAN is able to perform deterministic motion transfer and motion generation.
The proposed model is trained in an end-to-end manner, without the need to utilize pre-defined motion features like pose or facial landmarks.
arXiv Detail & Related papers (2021-02-26T06:54:48Z) - Hierarchical Style-based Networks for Motion Synthesis [150.226137503563]
We propose a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location.
Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner.
On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion.
arXiv Detail & Related papers (2020-08-24T02:11:02Z) - Non-Adversarial Video Synthesis with Learned Priors [53.26777815740381]
We focus on the problem of generating videos from latent noise vectors, without any reference input frames.
We develop a novel approach that jointly optimize the input latent space, the weights of a recurrent neural network and a generator through non-adversarial learning.
Our approach generates superior quality videos compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2020-03-21T02:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.