Emotional Talking Head Generation based on Memory-Sharing and
Attention-Augmented Networks
- URL: http://arxiv.org/abs/2306.03594v1
- Date: Tue, 6 Jun 2023 11:31:29 GMT
- Title: Emotional Talking Head Generation based on Memory-Sharing and
Attention-Augmented Networks
- Authors: Jianrong Wang, Yaxin Zhao, Li Liu, Tianyi Xu, Qi Li, Sen Li
- Abstract summary: We propose a talking head generation model consisting of a Memory-Sharing Emotion Feature extractor and an Attention-Augmented Translator based on U-net.
MSEF can extract implicit emotional auxiliary features from audio to estimate more accurate emotional face landmarks.
AATU acts as a translator between the estimated landmarks and the photo-realistic video frames.
- Score: 21.864200803678003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an audio clip and a reference face image, the goal of the talking head
generation is to generate a high-fidelity talking head video. Although some
audio-driven methods of generating talking head videos have made some
achievements in the past, most of them only focused on lip and audio
synchronization and lack the ability to reproduce the facial expressions of the
target person. To this end, we propose a talking head generation model
consisting of a Memory-Sharing Emotion Feature extractor (MSEF) and an
Attention-Augmented Translator based on U-net (AATU). Firstly, MSEF can extract
implicit emotional auxiliary features from audio to estimate more accurate
emotional face landmarks.~Secondly, AATU acts as a translator between the
estimated landmarks and the photo-realistic video frames. Extensive qualitative
and quantitative experiments have shown the superiority of the proposed method
to the previous works. Codes will be made publicly available.
Related papers
- Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation [13.135789543388801]
We propose a one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from general Talking Face Generation.
We introduce the Inter-Reconstructed Feature Disentanglement (IRFD) method to decouple human facial features into three latent spaces.
We then design a face editing module that modifies speech content and facial latent codes into a single latent space.
arXiv Detail & Related papers (2024-05-12T11:41:44Z) - DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for
Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently.
Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style.
Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z) - Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking
Embedding [32.006763134518245]
This paper proposes a talking face generation method named "CP-EB"
It takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking.
Experimental results show that the proposed method can generate photo-realistic talking face with synchronous lips motions, natural head poses and blinking eyes.
arXiv Detail & Related papers (2023-11-15T03:37:41Z) - MFR-Net: Multi-faceted Responsive Listening Head Generation via
Denoising Diffusion Model [14.220727407255966]
Responsive listening head generation is an important task that aims to model face-to-face communication scenarios.
We propose the textbfMulti-textbfFaceted textbfResponsive Listening Head Generation Network (MFR-Net)
arXiv Detail & Related papers (2023-08-31T11:10:28Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z) - Audio-Driven Emotional Video Portraits [79.95687903497354]
We present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios.
Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces.
With the disentangled features, dynamic 2D emotional facial landmarks can be deduced.
Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits.
arXiv Detail & Related papers (2021-04-15T13:37:13Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.