Talking Head Generation with Audio and Speech Related Facial Action
Units
- URL: http://arxiv.org/abs/2110.09951v1
- Date: Tue, 19 Oct 2021 13:14:27 GMT
- Title: Talking Head Generation with Audio and Speech Related Facial Action
Units
- Authors: Sen Chen, Zhilei Liu, Jiaxing Liu, Zhengxiang Yan, Longbiao Wang
- Abstract summary: The task of talking head generation is to synthesize a lip synchronized talking head video by inputting an arbitrary face image and audio clips.
We propose a novel recurrent generative network that uses both audio and speech-related facial action units (AUs) as the driving information.
- Score: 23.12239373576773
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of talking head generation is to synthesize a lip synchronized
talking head video by inputting an arbitrary face image and audio clips. Most
existing methods ignore the local driving information of the mouth muscles. In
this paper, we propose a novel recurrent generative network that uses both
audio and speech-related facial action units (AUs) as the driving information.
AU information related to the mouth can guide the movement of the mouth more
accurately. Since speech is highly correlated with speech-related AUs, we
propose an Audio-to-AU module in our system to predict the speech-related AU
information from speech. In addition, we use AU classifier to ensure that the
generated images contain correct AU information. Frame discriminator is also
constructed for adversarial training to improve the realism of the generated
face. We verify the effectiveness of our model on the GRID dataset and
TCD-TIMIT dataset. We also conduct an ablation study to verify the contribution
of each component in our model. Quantitative and qualitative experiments
demonstrate that our method outperforms existing methods in both image quality
and lip-sync accuracy.
Related papers
- Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z) - Talking Head Generation Driven by Speech-Related Facial Action Units and
Audio- Based on Multimodal Representation Fusion [30.549120935873407]
Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips.
Existing methods ignore not only the interaction and relationship of cross-modal information, but also the local driving information of the mouth muscles.
We propose a novel generative framework that contains a dilated non-causal temporal convolutional self-attention network.
arXiv Detail & Related papers (2022-04-27T08:05:24Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - One-shot Talking Face Generation from Single-speaker Audio-Visual
Correlation Learning [20.51814865676907]
It would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements.
We propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker.
Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements.
arXiv Detail & Related papers (2021-12-06T02:53:51Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.