Audio-Visual Efficient Conformer for Robust Speech Recognition
- URL: http://arxiv.org/abs/2301.01456v1
- Date: Wed, 4 Jan 2023 05:36:56 GMT
- Title: Audio-Visual Efficient Conformer for Robust Speech Recognition
- Authors: Maxime Burchi and Radu Timofte
- Abstract summary: We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
- Score: 91.3755431537592
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: End-to-end Automatic Speech Recognition (ASR) systems based on neural
networks have seen large improvements in recent years. The availability of
large scale hand-labeled datasets and sufficient computing resources made it
possible to train powerful deep neural networks, reaching very low Word Error
Rate (WER) on academic benchmarks. However, despite impressive performance on
clean audio samples, a drop of performance is often observed on noisy speech.
In this work, we propose to improve the noise robustness of the recently
proposed Efficient Conformer Connectionist Temporal Classification (CTC)-based
architecture by processing both audio and visual modalities. We improve
previous lip reading methods using an Efficient Conformer back-end on top of a
ResNet-18 visual front-end and by adding intermediate CTC losses between
blocks. We condition intermediate block features on early predictions using
Inter CTC residual modules to relax the conditional independence assumption of
CTC-based models. We also replace the Efficient Conformer grouped attention by
a more efficient and simpler attention mechanism that we call patch attention.
We experiment with publicly available Lip Reading Sentences 2 (LRS2) and Lip
Reading Sentences 3 (LRS3) datasets. Our experiments show that using audio and
visual modalities allows to better recognize speech in the presence of
environmental noise and significantly accelerate training, reaching lower WER
with 4 times less training steps. Our Audio-Visual Efficient Conformer (AVEC)
model achieves state-of-the-art performance, reaching WER of 2.3% and 1.8% on
LRS2 and LRS3 test sets. Code and pretrained models are available at
https://github.com/burchim/AVEC.
Related papers
- CR-CTC: Consistency regularization on CTC for improved speech recognition [18.996929774821822]
Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR)
However, it often falls short in recognition performance compared to transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED)
We propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram.
arXiv Detail & Related papers (2024-10-07T14:56:07Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.