Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
- URL: http://arxiv.org/abs/2403.16071v2
- Date: Thu, 2 May 2024 08:53:35 GMT
- Title: Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
- Authors: Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu, Liang Xie, Ye Yan, Erwei Yin,
- Abstract summary: We propose to exploit lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features.
A max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations.
- Score: 4.801824063852808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.
Related papers
- Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language [48.17930606488952]
Lip reading aims to predict spoken language by analyzing lip movements.
Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers.
We propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels.
arXiv Detail & Related papers (2024-09-02T07:05:12Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z) - Speaker-adaptive Lip Reading with User-dependent Padding [34.85015917909356]
Lip reading aims to predict speech based on lip movements alone.
As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements.
Speaker adaptation technique aims to reduce this mismatch between train and test speakers.
arXiv Detail & Related papers (2022-08-09T01:59:30Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.