Related papers: Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

URL: http://arxiv.org/abs/2409.00986v2
Date: Wed, 01 Jan 2025 06:10:16 GMT
Title: Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Authors: Jeong Hun Yeo, Chae Won Kim, Hyunjun Kim, Hyeongseop Rha, Seunghee Han, Wen-Huang Cheng, Yong Man Ro,
Abstract summary: Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers. We propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels.
Score: 48.17930606488952
License:
Abstract: Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. However, the effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in previous works. Additionally, existing datasets for speaker adaptation have limited vocabulary sizes and pose variations, which restrict the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. Furthermore, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in the wild, sentence-level lip reading for the first time in English. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, we show that the proposed method achieves larger improvements compared to the previous works.

Related papers

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization [4.801824063852808]
We propose to exploit lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features. A max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations.
arXiv Detail & Related papers (2024-03-24T09:18:21Z)
Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks. Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks. Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z)
LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers. Generalizing these methods to unseen speakers incurs catastrophic performance degradation. We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z)
Speaker-adaptive Lip Reading with User-dependent Padding [34.85015917909356]
Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. Speaker adaptation technique aims to reduce this mismatch between train and test speakers.
arXiv Detail & Related papers (2022-08-09T01:59:30Z)
Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z)
Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions. We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning. In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
LiRA: Learning Visual Speech Representations from Audio through Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA) Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level. By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.