Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation
in the Wild
- URL: http://arxiv.org/abs/2203.03984v1
- Date: Tue, 8 Mar 2022 10:18:25 GMT
- Title: Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation
in the Wild
- Authors: Ganglai Wang, Peng Zhang, Lei Xie, Wei Huang and Yufei Zha
- Abstract summary: Motivated by xxx, in this paper, an AttnWav2Lip model is proposed by incorporating spatial attention module and channel attention module into lip-syncing strategy.
To our limited knowledge, this is the first attempt to introduce attention mechanism to the scheme of talking face generation.
- Score: 17.471128300990244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking face generation with great practical significance has attracted more
attention in recent audio-visual studies. How to achieve accurate lip
synchronization is a long-standing challenge to be further investigated.
Motivated by xxx, in this paper, an AttnWav2Lip model is proposed by
incorporating spatial attention module and channel attention module into
lip-syncing strategy. Rather than focusing on the unimportant regions of the
face image, the proposed AttnWav2Lip model is able to pay more attention on the
lip region reconstruction. To our limited knowledge, this is the first attempt
to introduce attention mechanism to the scheme of talking face generation. An
extensive experiments have been conducted to evaluate the effectiveness of the
proposed model. Compared to the baseline measured by LSE-D and LSE-C metrics, a
superior performance has been demonstrated on the benchmark lip synthesis
datasets, including LRW, LRS2 and LRS3.
Related papers
- LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition [12.336693356113308]
We propose a novel framework, LipGen, to improve model robustness.
We introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms.
Our method demonstrates superior performance compared to the current state-of-the-art on the lip reading in the wild (LRW) dataset.
arXiv Detail & Related papers (2025-01-08T00:52:19Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation [51.92522679353731]
We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training.
We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
arXiv Detail & Related papers (2024-05-07T13:55:50Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention [98.52189797347354]
We introduce multi-scale processing into the spatial feature extraction for lip-reading.
We merge information in all time steps of the sequence by utilizing self-attention.
Our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art.
arXiv Detail & Related papers (2020-12-28T16:55:51Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.