Sub-word Level Lip Reading With Visual Attention
- URL: http://arxiv.org/abs/2110.07603v1
- Date: Thu, 14 Oct 2021 17:59:57 GMT
- Title: Sub-word Level Lip Reading With Visual Attention
- Authors: Prajwal K R, Triantafyllos Afouras, Andrew Zisserman
- Abstract summary: We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
- Score: 88.89348882036512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this paper is to learn strong lip reading models that can
recognise speech in silent videos. Most prior works deal with the open-set
visual speech recognition problem by adapting existing automatic speech
recognition techniques on top of trivially pooled visual features. Instead, in
this paper we focus on the unique challenges encountered in lip reading and
propose tailored solutions. To that end we make the following contributions:
(1) we propose an attention-based pooling mechanism to aggregate visual speech
representations; (2) we use sub-word units for lip reading for the first time
and show that this allows us to better model the ambiguities of the task; (3)
we propose a training pipeline that balances the lip reading performance with
other key factors such as data and compute efficiency. Following the above, we
obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks
when training on public datasets, and even surpass models trained on
large-scale industrial datasets by using an order of magnitude less data. Our
best model achieves 22.6% word error rate on the LRS2 dataset, a performance
unprecedented for lip reading models, significantly reducing the performance
gap between lip reading and automatic speech recognition.
Related papers
- Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip
Reading [0.06157382820537718]
We propose a new lip-reading model that combines three contributions.
On LRW lip-reading dataset benchmark, a noticeable accuracy improvement is demonstrated.
arXiv Detail & Related papers (2021-08-07T23:46:25Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.