Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
- URL: http://arxiv.org/abs/2409.00986v1
- Date: Mon, 2 Sep 2024 07:05:12 GMT
- Title: Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
- Authors: Jeong Hun Yeo, Chae Won Kim, Hyunjun Kim, Hyeongseop Rha, Seunghee Han, Wen-Huang Cheng, Yong Man Ro,
- Abstract summary: Lip reading aims to predict spoken language by analyzing lip movements.
Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers.
We propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels.
- Score: 48.17930606488952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. The effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in the previous works. Moreover, existing datasets for speaker adaptation have limited vocabulary size and pose variations, limiting the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in wild, sentence-level lip reading for the first time. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, with the proposed adaptation method, we show that the proposed method achieves larger improvements when applied to the target speaker, compared to the previous works.
Related papers
- Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization [4.801824063852808]
We propose to exploit lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features.
A max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations.
arXiv Detail & Related papers (2024-03-24T09:18:21Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Lip-to-Speech Synthesis in the Wild with Multi-task Learning [32.65865343643458]
We develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment.
We design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss.
arXiv Detail & Related papers (2023-02-17T12:31:26Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z) - Speaker-adaptive Lip Reading with User-dependent Padding [34.85015917909356]
Lip reading aims to predict speech based on lip movements alone.
As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements.
Speaker adaptation technique aims to reduce this mismatch between train and test speakers.
arXiv Detail & Related papers (2022-08-09T01:59:30Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.