Speaker-adaptive Lip Reading with User-dependent Padding
- URL: http://arxiv.org/abs/2208.04498v1
- Date: Tue, 9 Aug 2022 01:59:30 GMT
- Title: Speaker-adaptive Lip Reading with User-dependent Padding
- Authors: Minsu Kim, Hyunjun Kim, Yong Man Ro
- Abstract summary: Lip reading aims to predict speech based on lip movements alone.
As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements.
Speaker adaptation technique aims to reduce this mismatch between train and test speakers.
- Score: 34.85015917909356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip reading aims to predict speech based on lip movements alone. As it
focuses on visual information to model the speech, its performance is
inherently sensitive to personal lip appearances and movements. This makes the
lip reading models show degraded performance when they are applied to unseen
speakers due to the mismatch between training and testing conditions. Speaker
adaptation technique aims to reduce this mismatch between train and test
speakers, thus guiding a trained model to focus on modeling the speech content
without being intervened by the speaker variations. In contrast to the efforts
made in audio-based speech recognition for decades, the speaker adaptation
methods have not well been studied in lip reading. In this paper, to remedy the
performance degradation of lip reading model on unseen speakers, we propose a
speaker-adaptive lip reading method, namely user-dependent padding. The
user-dependent padding is a speaker-specific input that can participate in the
visual feature extraction stage of a pre-trained lip reading model. Therefore,
the lip appearances and movements information of different speakers can be
considered during the visual feature encoding, adaptively for individual
speakers. Moreover, the proposed method does not need 1) any additional layers,
2) to modify the learned weights of the pre-trained model, and 3) the speaker
label of train data used during pre-train. It can directly adapt to unseen
speakers by learning the user-dependent padding only, in a supervised or
unsupervised manner. Finally, to alleviate the speaker information
insufficiency in public lip reading databases, we label the speaker of a
well-known audio-visual database, LRW, and design an unseen-speaker lip reading
scenario named LRW-ID.
Related papers
- Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language [48.17930606488952]
Lip reading aims to predict spoken language by analyzing lip movements.
Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers.
We propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels.
arXiv Detail & Related papers (2024-09-02T07:05:12Z) - Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization [4.801824063852808]
We propose to exploit lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features.
A max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations.
arXiv Detail & Related papers (2024-03-24T09:18:21Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.