Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading
- URL: http://arxiv.org/abs/2310.05058v3
- Date: Tue, 30 Apr 2024 11:20:47 GMT
- Title: Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading
- Authors: Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen,
- Abstract summary: A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
- Score: 73.59525356467574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.
Related papers
- Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language [48.17930606488952]
Lip reading aims to predict spoken language by analyzing lip movements.
Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers.
We propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels.
arXiv Detail & Related papers (2024-09-02T07:05:12Z) - Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization [4.801824063852808]
We propose to exploit lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features.
A max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations.
arXiv Detail & Related papers (2024-03-24T09:18:21Z) - LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z) - Speaker-adaptive Lip Reading with User-dependent Padding [34.85015917909356]
Lip reading aims to predict speech based on lip movements alone.
As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements.
Speaker adaptation technique aims to reduce this mismatch between train and test speakers.
arXiv Detail & Related papers (2022-08-09T01:59:30Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.