Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with
Image Diffusion Model
- URL: http://arxiv.org/abs/2308.07749v1
- Date: Tue, 15 Aug 2023 13:00:42 GMT
- Title: Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with
Image Diffusion Model
- Authors: Bosheng Qin, Wentao Ye, Qifan Yu, Siliang Tang, Yueting Zhuang
- Abstract summary: We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues.
Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion.
- Score: 57.855362366674264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rising demand for creating lifelike avatars in the digital realm has led
to an increased need for generating high-quality human videos guided by textual
descriptions and poses. We propose Dancing Avatar, designed to fabricate human
motion videos driven by poses and textual cues. Our approach employs a
pretrained T2I diffusion model to generate each video frame in an
autoregressive fashion. The crux of innovation lies in our adept utilization of
the T2I diffusion model for producing video frames successively while
preserving contextual relevance. We surmount the hurdles posed by maintaining
human character and clothing consistency across varying poses, along with
upholding the background's continuity amidst diverse human movements. To ensure
consistent human appearances across the entire video, we devise an intra-frame
alignment module. This module assimilates text-guided synthesized human
character knowledge into the pretrained T2I diffusion model, synergizing
insights from ChatGPT. For preserving background continuity, we put forth a
background alignment pipeline, amalgamating insights from segment anything and
image inpainting techniques. Furthermore, we propose an inter-frame alignment
module that draws inspiration from an auto-regressive pipeline to augment
temporal consistency between adjacent frames, where the preceding frame guides
the synthesis process of the current frame. Comparisons with state-of-the-art
methods demonstrate that Dancing Avatar exhibits the capacity to generate human
videos with markedly superior quality, both in terms of human and background
fidelity, as well as temporal coherence compared to existing state-of-the-art
approaches.
Related papers
- Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation [15.569467643817447]
We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations.
We train on real-world videos enhanced with this innovative motion depiction approach.
To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy.
arXiv Detail & Related papers (2024-05-26T00:53:26Z) - LatentMan: Generating Consistent Animated Characters using Image Diffusion Models [44.18315132571804]
We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models.
Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference.
arXiv Detail & Related papers (2023-12-12T10:07:37Z) - Towards 4D Human Video Stylization [56.33756124829298]
We present a first step towards 4D (3D and time) human video stylization, which addresses style transfer, novel view synthesis and human animation.
We leverage Neural Radiance Fields (NeRFs) to represent videos, conducting stylization in the rendered feature space.
Our framework uniquely extends its capabilities to accommodate novel poses and viewpoints, making it a versatile tool for creative human video stylization.
arXiv Detail & Related papers (2023-12-07T08:58:33Z) - Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [27.700371215886683]
diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities.
In this paper, we propose a novel framework tailored for character animation.
By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods.
arXiv Detail & Related papers (2023-11-28T12:27:15Z) - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [63.43133768897087]
We propose a method to convert open-domain images into animated videos.
The key idea is to utilize the motion prior to text-to-video diffusion models by incorporating the image into the generative process as guidance.
Our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image.
arXiv Detail & Related papers (2023-10-18T14:42:16Z) - Text2Performer: Text-Driven Human Video Generation [97.3849869893433]
Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity.
Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer.
In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts.
arXiv Detail & Related papers (2023-04-17T17:59:02Z) - DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion [63.179505586264014]
We present DreamPose, a diffusion-based method for generating animated fashion videos from still images.
Given an image and a sequence of human body poses, our method synthesizes a video containing both human and fabric motion.
arXiv Detail & Related papers (2023-04-12T17:59:17Z) - Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis [124.48519390371636]
Transfering human motion from a source to a target person poses great potential in computer vision and graphics applications.
Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person.
This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person.
arXiv Detail & Related papers (2021-10-27T03:42:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.