Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation
in the Wild
- URL: http://arxiv.org/abs/2203.03984v1
- Date: Tue, 8 Mar 2022 10:18:25 GMT
- Title: Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation
in the Wild
- Authors: Ganglai Wang, Peng Zhang, Lei Xie, Wei Huang and Yufei Zha
- Abstract summary: Motivated by xxx, in this paper, an AttnWav2Lip model is proposed by incorporating spatial attention module and channel attention module into lip-syncing strategy.
To our limited knowledge, this is the first attempt to introduce attention mechanism to the scheme of talking face generation.
- Score: 17.471128300990244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking face generation with great practical significance has attracted more
attention in recent audio-visual studies. How to achieve accurate lip
synchronization is a long-standing challenge to be further investigated.
Motivated by xxx, in this paper, an AttnWav2Lip model is proposed by
incorporating spatial attention module and channel attention module into
lip-syncing strategy. Rather than focusing on the unimportant regions of the
face image, the proposed AttnWav2Lip model is able to pay more attention on the
lip region reconstruction. To our limited knowledge, this is the first attempt
to introduce attention mechanism to the scheme of talking face generation. An
extensive experiments have been conducted to evaluate the effectiveness of the
proposed model. Compared to the baseline measured by LSE-D and LSE-C metrics, a
superior performance has been demonstrated on the benchmark lip synthesis
datasets, including LRW, LRS2 and LRS3.
Related papers
- RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [63.77823518278202]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation [51.92522679353731]
We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training.
We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
arXiv Detail & Related papers (2024-05-07T13:55:50Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention [98.52189797347354]
We introduce multi-scale processing into the spatial feature extraction for lip-reading.
We merge information in all time steps of the sequence by utilizing self-attention.
Our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art.
arXiv Detail & Related papers (2020-12-28T16:55:51Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.