Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis
- URL: http://arxiv.org/abs/2110.14147v2
- Date: Thu, 28 Oct 2021 03:08:58 GMT
- Title: Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis
- Authors: Bowen Wu, Zhenyu Xie, Xiaodan Liang, Yubei Xiao, Haoye Dong, Liang Lin
- Abstract summary: Transfering human motion from a source to a target person poses great potential in computer vision and graphics applications.
Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person.
This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person.
- Score: 124.48519390371636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transferring human motion from a source to a target person poses great
potential in computer vision and graphics applications. A crucial step is to
manipulate sequential future motion while retaining the appearance
characteristic.Previous work has either relied on crafted 3D human models or
trained a separate model specifically for each target person, which is not
scalable in practice.This work studies a more general setting, in which we aim
to learn a single model to parsimoniously transfer motion from a source video
to any target person given only one image of the person, named as Collaborative
Parsing-Flow Network (CPF-Net). The paucity of information regarding the target
person makes the task particularly challenging to faithfully preserve the
appearance in varying designated poses. To address this issue, CPF-Net
integrates the structured human parsing and appearance flow to guide the
realistic foreground synthesis which is merged into the background by a
spatio-temporal fusion module. In particular, CPF-Net decouples the problem
into stages of human parsing sequence generation, foreground sequence
generation and final video generation. The human parsing generation stage
captures both the pose and the body structure of the target. The appearance
flow is beneficial to keep details in synthesized frames. The integration of
human parsing and appearance flow effectively guides the generation of video
frames with realistic appearance. Finally, the dedicated designed fusion
network ensure the temporal coherence. We further collect a large set of human
dancing videos to push forward this research field. Both quantitative and
qualitative results show our method substantially improves over previous
approaches and is able to generate appealing and photo-realistic target videos
given any input person image. All source code and dataset will be released at
https://github.com/xiezhy6/CPF-Net.
Related papers
- Do As I Do: Pose Guided Human Motion Copy [39.40271266234068]
Motion copy is an intriguing yet challenging task in artificial intelligence and computer vision.
Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video.
We present an episodic memory module in the pose-to-appearance generation to propel continuous learning.
Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.
arXiv Detail & Related papers (2024-06-24T12:41:51Z) - VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis [40.869862603815875]
VLOGGER is a method for audio-driven human video generation from a single input image.
We use a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls.
We show applications in video editing and personalization.
arXiv Detail & Related papers (2024-03-13T17:59:02Z) - Do You Guys Want to Dance: Zero-Shot Compositional Human Dance
Generation with Multiple Persons [73.21855272778616]
We introduce a new task, dataset, and evaluation protocol of compositional human dance generation (cHDG)
We propose a novel zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos consistent with arbitrary multiple persons and background while precisely following the driving poses.
arXiv Detail & Related papers (2024-01-24T10:44:16Z) - Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with
Image Diffusion Model [57.855362366674264]
We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues.
Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion.
arXiv Detail & Related papers (2023-08-15T13:00:42Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input.
Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z) - Liquid Warping GAN with Attention: A Unified Framework for Human Image
Synthesis [58.05389586712485]
We tackle human image synthesis, including human motion imitation, appearance transfer, and novel view synthesis.
In this paper, we propose a 3D body mesh recovery module to disentangle the pose and shape.
We also build a new dataset, namely iPER dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis.
arXiv Detail & Related papers (2020-11-18T02:57:47Z) - Speech2Video Synthesis with 3D Skeleton Regularization and Expressive
Body Poses [36.00309828380724]
We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person.
We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN)
To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
arXiv Detail & Related papers (2020-07-17T19:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.