Related papers: Speaker-adaptive Lip Reading with User-dependent Padding

Speaker-adaptive Lip Reading with User-dependent Padding

URL: http://arxiv.org/abs/2208.04498v1
Date: Tue, 9 Aug 2022 01:59:30 GMT
Title: Speaker-adaptive Lip Reading with User-dependent Padding
Authors: Minsu Kim, Hyunjun Kim, Yong Man Ro
Abstract summary: Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. Speaker adaptation technique aims to reduce this mismatch between train and test speakers.
Score: 34.85015917909356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims to reduce this mismatch between train and test speakers, thus guiding a trained model to focus on modeling the speech content without being intervened by the speaker variations. In contrast to the efforts made in audio-based speech recognition for decades, the speaker adaptation methods have not well been studied in lip reading. In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding. The user-dependent padding is a speaker-specific input that can participate in the visual feature extraction stage of a pre-trained lip reading model. Therefore, the lip appearances and movements information of different speakers can be considered during the visual feature encoding, adaptively for individual speakers. Moreover, the proposed method does not need 1) any additional layers, 2) to modify the learned weights of the pre-trained model, and 3) the speaker label of train data used during pre-train. It can directly adapt to unseen speakers by learning the user-dependent padding only, in a supervised or unsupervised manner. Finally, to alleviate the speaker information insufficiency in public lip reading databases, we label the speaker of a well-known audio-visual database, LRW, and design an unseen-speaker lip reading scenario named LRW-ID.

Related papers

Learning Speaker-Invariant Visual Features for Lipreading [54.670614643480505]
Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text.<n>Existing lipreading methods often extract speaker-specific lip attributes that introduce spurious correlations between vision and text.<n>We introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes.
arXiv Detail & Related papers (2025-06-09T09:16:14Z)
Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language [48.17930606488952]
Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers. We propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels.
arXiv Detail & Related papers (2024-09-02T07:05:12Z)
Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization [4.801824063852808]
We propose to exploit lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features. A max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations.
arXiv Detail & Related papers (2024-03-24T09:18:21Z)
Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks. Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks. Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z)
LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers. Generalizing these methods to unseen speakers incurs catastrophic performance degradation. We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z)
Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z)
Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions. We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z)
LiRA: Learning Visual Speech Representations from Audio through Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA) Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.