LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers
- URL: http://arxiv.org/abs/2302.02141v1
- Date: Sat, 4 Feb 2023 10:22:18 GMT
- Title: LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers
- Authors: Feng Xue, Yu Li, Deyin Liu, Yincen Xie, Lin Wu, Richang Hong
- Abstract summary: State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
- Score: 43.13868262922689
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Lipreading refers to understanding and further translating the speech of a
speaker in the video into natural language. State-of-the-art lipreading methods
excel in interpreting overlap speakers, i.e., speakers appear in both training
and inference sets. However, generalizing these methods to unseen speakers
incurs catastrophic performance degradation due to the limited number of
speakers in training bank and the evident visual variations caused by the
shape/color of lips for different speakers. Therefore, merely depending on the
visible changes of lips tends to cause model overfitting. To address this
problem, we propose to use multi-modal features across visual and landmarks,
which can describe the lip motion irrespective to the speaker identities. Then,
we develop a sentence-level lipreading framework based on visual-landmark
transformers, namely LipFormer. Specifically, LipFormer consists of a lip
motion stream, a facial landmark stream, and a cross-modal fusion. The
embeddings from the two streams are produced by self-attention, which are fed
to the cross-attention module to achieve the alignment between visuals and
landmarks. Finally, the resulting fused features can be decoded to output texts
by a cascade seq2seq model. Experiments demonstrate that our method can
effectively enhance the model generalization to unseen speakers.
Related papers
- Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language [48.17930606488952]
Lip reading aims to predict spoken language by analyzing lip movements.
Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers.
We propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels.
arXiv Detail & Related papers (2024-09-02T07:05:12Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization [4.801824063852808]
We propose to exploit lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features.
A max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations.
arXiv Detail & Related papers (2024-03-24T09:18:21Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend
3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields.
Previous research has generated promising realistic lip movements and facial expressions from audio signals.
We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z) - Speaker-adaptive Lip Reading with User-dependent Padding [34.85015917909356]
Lip reading aims to predict speech based on lip movements alone.
As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements.
Speaker adaptation technique aims to reduce this mismatch between train and test speakers.
arXiv Detail & Related papers (2022-08-09T01:59:30Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Deformation Flow Based Two-Stream Network for Lip Reading [90.61063126619182]
Lip reading is the task of recognizing the speech content by analyzing movements in the lip region when people are speaking.
We observe the continuity in adjacent frames in the speaking process, and the consistency of the motion patterns among different speakers when they pronounce the same phoneme.
We introduce a Deformation Flow Network (DFN) to learn the deformation flow between adjacent frames, which directly captures the motion information within the lip region.
The learned deformation flow is then combined with the original grayscale frames with a two-stream network to perform lip reading.
arXiv Detail & Related papers (2020-03-12T11:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.