Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip
Reading
- URL: http://arxiv.org/abs/2108.03543v1
- Date: Sat, 7 Aug 2021 23:46:25 GMT
- Title: Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip
Reading
- Authors: Shahd Elashmawy, Marian Ramsis, Hesham M. Eraqi, Farah Eldeshnawy,
Hadeel Mabrouk, Omar Abugabal, Nourhan Sakr
- Abstract summary: We propose a new lip-reading model that combines three contributions.
On LRW lip-reading dataset benchmark, a noticeable accuracy improvement is demonstrated.
- Score: 0.06157382820537718
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite the advancement in the domain of audio and audio-visual speech
recognition, visual speech recognition systems are still quite under-explored
due to the visual ambiguity of some phonemes. In this work, we propose a new
lip-reading model that combines three contributions. First, the model front-end
adopts a spatio-temporal attention mechanism to help extract the informative
data from the input visual frames. Second, the model back-end utilizes a
sequence-level and frame-level Knowledge Distillation (KD) techniques that
allow leveraging audio data during the visual model training. Third, a data
preprocessing pipeline is adopted that includes facial landmarks
detection-based lip-alignment. On LRW lip-reading dataset benchmark, a
noticeable accuracy improvement is demonstrated; the spatio-temporal attention,
Knowledge Distillation, and lip-alignment contributions achieved 88.43%,
88.64%, and 88.37% respectively.
Related papers
- Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Lip-Listening: Mixing Senses to Understand Lips using Cross Modality
Knowledge Distillation for Word-Based Models [0.03499870393443267]
This work builds on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems.
We propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training.
arXiv Detail & Related papers (2022-06-05T15:47:54Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided
Adaptive Memory [61.44510300515693]
We study the task of simultaneous lip and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory.
The experiments show that the SimulLR achieves the translation speedup 9.10 times times compared with the state-of-the-art non-simultaneous methods.
arXiv Detail & Related papers (2021-08-31T05:54:16Z) - Supervised Contrastive Learning for Accented Speech Recognition [7.5253263976291676]
We study the supervised contrastive learning framework for accented speech recognition.
We show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average.
arXiv Detail & Related papers (2021-07-02T09:23:33Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.