Everybody's Talkin': Let Me Talk as You Want
- URL: http://arxiv.org/abs/2001.05201v1
- Date: Wed, 15 Jan 2020 09:54:23 GMT
- Title: Everybody's Talkin': Let Me Talk as You Want
- Authors: Linsen Song, Wayne Wu, Chen Qian, Ran He, Chen Change Loy
- Abstract summary: We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
- Score: 134.65914135774605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method to edit a target portrait footage by taking a sequence of
audio as input to synthesize a photo-realistic video. This method is unique
because it is highly dynamic. It does not assume a person-specific rendering
network yet capable of translating arbitrary source audio into arbitrary video
output. Instead of learning a highly heterogeneous and nonlinear mapping from
audio to the video directly, we first factorize each target video frame into
orthogonal parameter spaces, i.e., expression, geometry, and pose, via
monocular 3D face reconstruction. Next, a recurrent network is introduced to
translate source audio into expression parameters that are primarily related to
the audio content. The audio-translated expression parameters are then used to
synthesize a photo-realistic human subject in each video frame, with the
movement of the mouth regions precisely mapped to the source audio. The
geometry and pose parameters of the target human portrait are retained,
therefore preserving the context of the original video footage. Finally, we
introduce a novel video rendering network and a dynamic programming method to
construct a temporally coherent and photo-realistic video. Extensive
experiments demonstrate the superiority of our method over existing approaches.
Our method is end-to-end learnable and robust to voice variations in the source
audio.
Related papers
- ReliTalk: Relightable Talking Portrait Generation from a Single Video [62.47116237654984]
ReliTalk is a novel framework for relightable audio-driven talking portrait generation from monocular videos.
Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images.
arXiv Detail & Related papers (2023-09-05T17:59:42Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - VideoReTalking: Audio-based Lip Synchronization for Talking Head Video
Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio.
It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z) - Audio-driven Neural Gesture Reenactment with Video Motion Graphs [30.449816206864632]
We present a method that reenacts a high-quality video with gestures matching a target speech audio.
The key idea of our method is to split and re-assemble clips from a reference video through a novel video motion graph encoding valid transitions between clips.
To seamlessly connect different clips in the reenactment, we propose a pose-aware video blending network which synthesizes video frames around the stitched frames between two clips.
arXiv Detail & Related papers (2022-07-23T14:02:57Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z) - Facial Keypoint Sequence Generation from Audio [2.66512000865131]
This work proposes an audio-keypoint dataset and learns a model to output the plausible keypoint sequence to go with audio of any arbitrary length.
It is the first work that proposes an audio-keypoint dataset and learns a model to output the plausible keypoint sequence to go with audio of any arbitrary length.
arXiv Detail & Related papers (2020-11-02T16:47:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.