Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation
- URL: http://arxiv.org/abs/2405.16393v2
- Date: Tue, 28 May 2024 05:25:00 GMT
- Title: Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation
- Authors: Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui,
- Abstract summary: We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations.
We train on real-world videos enhanced with this innovative motion depiction approach.
To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy.
- Score: 15.569467643817447
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.
Related papers
- Replace Anyone in Videos [39.4019337319795]
We propose the ReplaceAnyone framework, which focuses on localizing and manipulating human motion in videos.
Specifically, we formulate this task as an image-conditioned pose-driven video inpainting paradigm.
We introduce diverse mask forms involving regular and irregular shapes to avoid shape leakage and allow granular local control.
arXiv Detail & Related papers (2024-09-30T03:27:33Z) - Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics [67.97235923372035]
We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics.
At test time, given a single image and a sparse set of motion trajectories, Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions.
arXiv Detail & Related papers (2024-08-08T17:59:38Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [27.700371215886683]
diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities.
In this paper, we propose a novel framework tailored for character animation.
By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods.
arXiv Detail & Related papers (2023-11-28T12:27:15Z) - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [63.43133768897087]
We propose a method to convert open-domain images into animated videos.
The key idea is to utilize the motion prior to text-to-video diffusion models by incorporating the image into the generative process as guidance.
Our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image.
arXiv Detail & Related papers (2023-10-18T14:42:16Z) - Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation.
M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse.
We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z) - Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with
Image Diffusion Model [57.855362366674264]
We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues.
Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion.
arXiv Detail & Related papers (2023-08-15T13:00:42Z) - LEO: Generative Latent Image Animator for Human Video Synthesis [42.925592662547814]
We propose a novel framework for human video synthesis, placing emphasis on synthesizing-temporal coherency.
Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance.
We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM)
arXiv Detail & Related papers (2023-05-06T09:29:12Z) - Continuous-Time Video Generation via Learning Motion Dynamics with
Neural ODE [26.13198266911874]
We propose a novel video generation approach that learns separate distributions for motion and appearance.
We employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector.
arXiv Detail & Related papers (2021-12-21T03:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.