StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation
- URL: http://arxiv.org/abs/2208.10922v2
- Date: Fri, 15 Mar 2024 08:48:04 GMT
- Title: StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation
- Authors: Dongchan Min, Minyoung Song, Eunji Ko, Sung Ju Hwang,
- Abstract summary: StyleTalker is an audio-driven talking head generation model.
It can synthesize a video of a talking person from a single reference image.
Our model is able to synthesize talking head videos with impressive perceptual quality.
- Score: 47.06075725469252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.
Related papers
- PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation [17.158581488104186]
Previous audio-driven talking head generation (THG) methods generate head poses from driving audio.
We propose textbfPoseTalk, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio.
arXiv Detail & Related papers (2024-09-04T12:30:25Z) - Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals.
We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video.
Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z) - ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - VideoReTalking: Audio-based Lip Synchronization for Talking Head Video
Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio.
It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z) - One-shot Talking Face Generation from Single-speaker Audio-Visual
Correlation Learning [20.51814865676907]
It would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements.
We propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker.
Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements.
arXiv Detail & Related papers (2021-12-06T02:53:51Z) - FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute
Learning [23.14865405847467]
We propose a talking face generation method that takes an audio signal as input and a short target video clip as reference.
The method synthesizes a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are in-sync with the input audio signal.
Experimental results and user studies show our method can generate realistic talking face videos with better qualities than the results of state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T02:10:26Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.