Speaker-adaptive Lip Reading with User-dependent Padding
- URL: http://arxiv.org/abs/2208.04498v1
- Date: Tue, 9 Aug 2022 01:59:30 GMT
- Title: Speaker-adaptive Lip Reading with User-dependent Padding
- Authors: Minsu Kim, Hyunjun Kim, Yong Man Ro
- Abstract summary: Lip reading aims to predict speech based on lip movements alone.
As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements.
Speaker adaptation technique aims to reduce this mismatch between train and test speakers.
- Score: 34.85015917909356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip reading aims to predict speech based on lip movements alone. As it
focuses on visual information to model the speech, its performance is
inherently sensitive to personal lip appearances and movements. This makes the
lip reading models show degraded performance when they are applied to unseen
speakers due to the mismatch between training and testing conditions. Speaker
adaptation technique aims to reduce this mismatch between train and test
speakers, thus guiding a trained model to focus on modeling the speech content
without being intervened by the speaker variations. In contrast to the efforts
made in audio-based speech recognition for decades, the speaker adaptation
methods have not well been studied in lip reading. In this paper, to remedy the
performance degradation of lip reading model on unseen speakers, we propose a
speaker-adaptive lip reading method, namely user-dependent padding. The
user-dependent padding is a speaker-specific input that can participate in the
visual feature extraction stage of a pre-trained lip reading model. Therefore,
the lip appearances and movements information of different speakers can be
considered during the visual feature encoding, adaptively for individual
speakers. Moreover, the proposed method does not need 1) any additional layers,
2) to modify the learned weights of the pre-trained model, and 3) the speaker
label of train data used during pre-train. It can directly adapt to unseen
speakers by learning the user-dependent padding only, in a supervised or
unsupervised manner. Finally, to alleviate the speaker information
insufficiency in public lip reading databases, we label the speaker of a
well-known audio-visual database, LRW, and design an unseen-speaker lip reading
scenario named LRW-ID.
Related papers
- Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization [4.801824063852808]
We propose to exploit lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features.
A max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations.
arXiv Detail & Related papers (2024-03-24T09:18:21Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.