Talking Head Generation Driven by Speech-Related Facial Action Units and
Audio- Based on Multimodal Representation Fusion
- URL: http://arxiv.org/abs/2204.12756v1
- Date: Wed, 27 Apr 2022 08:05:24 GMT
- Title: Talking Head Generation Driven by Speech-Related Facial Action Units and
Audio- Based on Multimodal Representation Fusion
- Authors: Sen Chen and Zhilei Liu and Jiaxing Liu and Longbiao Wang
- Abstract summary: Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips.
Existing methods ignore not only the interaction and relationship of cross-modal information, but also the local driving information of the mouth muscles.
We propose a novel generative framework that contains a dilated non-causal temporal convolutional self-attention network.
- Score: 30.549120935873407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking head generation is to synthesize a lip-synchronized talking head
video by inputting an arbitrary face image and corresponding audio clips.
Existing methods ignore not only the interaction and relationship of
cross-modal information, but also the local driving information of the mouth
muscles. In this study, we propose a novel generative framework that contains a
dilated non-causal temporal convolutional self-attention network as a
multimodal fusion module to promote the relationship learning of cross-modal
features. In addition, our proposed method uses both audio- and speech-related
facial action units (AUs) as driving information. Speech-related AU information
can guide mouth movements more accurately. Because speech is highly correlated
with speech-related AUs, we propose an audio-to-AU module to predict
speech-related AU information. We utilize pre-trained AU classifier to ensure
that the generated images contain correct AU information. We verify the
effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An
ablation study is also conducted to verify the contribution of each component.
The results of quantitative and qualitative experiments demonstrate that our
method outperforms existing methods in terms of both image quality and lip-sync
accuracy.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions.
We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Talking Head Generation with Audio and Speech Related Facial Action
Units [23.12239373576773]
The task of talking head generation is to synthesize a lip synchronized talking head video by inputting an arbitrary face image and audio clips.
We propose a novel recurrent generative network that uses both audio and speech-related facial action units (AUs) as the driving information.
arXiv Detail & Related papers (2021-10-19T13:14:27Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.