Facial Keypoint Sequence Generation from Audio
- URL: http://arxiv.org/abs/2011.01114v1
- Date: Mon, 2 Nov 2020 16:47:52 GMT
- Title: Facial Keypoint Sequence Generation from Audio
- Authors: Prateek Manocha and Prithwijit Guha
- Abstract summary: This work proposes an audio-keypoint dataset and learns a model to output the plausible keypoint sequence to go with audio of any arbitrary length.
It is the first work that proposes an audio-keypoint dataset and learns a model to output the plausible keypoint sequence to go with audio of any arbitrary length.
- Score: 2.66512000865131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whenever we speak, our voice is accompanied by facial movements and
expressions. Several recent works have shown the synthesis of highly
photo-realistic videos of talking faces, but they either require a source video
to drive the target face or only generate videos with a fixed head pose. This
lack of facial movement is because most of these works focus on the lip
movement in sync with the audio while assuming the remaining facial keypoints'
fixed nature. To address this, a unique audio-keypoint dataset of over 150,000
videos at 224p and 25fps is introduced that relates the facial keypoint
movement for the given audio. This dataset is then further used to train the
model, Audio2Keypoint, a novel approach for synthesizing facial keypoint
movement to go with the audio. Given a single image of the target person and an
audio sequence (in any language), Audio2Keypoint generates a plausible keypoint
movement sequence in sync with the input audio, conditioned on the input image
to preserve the target person's facial characteristics. To the best of our
knowledge, this is the first work that proposes an audio-keypoint dataset and
learns a model to output the plausible keypoint sequence to go with audio of
any arbitrary length. Audio2Keypoint generalizes across unseen people with a
different facial structure allowing us to generate the sequence with the voice
from any source or even synthetic voices. Instead of learning a direct mapping
from audio to video domain, this work aims to learn the audio-keypoint mapping
that allows for in-plane and out-of-plane head rotations, while preserving the
person's identity using a Pose Invariant (PIV) Encoder.
Related papers
- Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Audio-Driven Talking Face Generation with Diverse yet Realistic Facial
Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio.
To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network.
We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - A Keypoint Based Enhancement Method for Audio Driven Free View Talking
Head Synthesis [14.303621416852602]
Keypoint Based Enhancement (KPBE) method is proposed for audio driven free view talking head synthesis.
Experiments show that our proposed enhancement method improved the quality of talking-head videos in terms of mean opinion score.
arXiv Detail & Related papers (2022-10-07T05:44:10Z) - StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation [47.06075725469252]
StyleTalker is an audio-driven talking head generation model.
It can synthesize a video of a talking person from a single reference image.
Our model is able to synthesize talking head videos with impressive perceptual quality.
arXiv Detail & Related papers (2022-08-23T12:49:01Z) - One-shot Talking Face Generation from Single-speaker Audio-Visual
Correlation Learning [20.51814865676907]
It would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements.
We propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker.
Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements.
arXiv Detail & Related papers (2021-12-06T02:53:51Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z) - Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.