Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
- URL: http://arxiv.org/abs/2303.14307v3
- Date: Wed, 28 Jun 2023 14:41:17 GMT
- Title: Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
- Authors: Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie
Chen, Stavros Petridis, Maja Pantic
- Abstract summary: We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
- Score: 100.43280310123784
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Audio-visual speech recognition has received a lot of attention due to its
robustness against acoustic noise. Recently, the performance of automatic,
visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR,
respectively) has been substantially improved, mainly due to the use of larger
models and training sets. However, accurate labelling of datasets is
time-consuming and expensive. Hence, in this work, we investigate the use of
automatically-generated transcriptions of unlabelled datasets to increase the
training set size. For this purpose, we use publicly-available pre-trained ASR
models to automatically transcribe unlabelled datasets such as AVSpeech and
VoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented training
set, which consists of the LRS2 and LRS3 datasets as well as the additional
automatically-transcribed data. We demonstrate that increasing the size of the
training set, a recent trend in the literature, leads to reduced WER despite
using noisy transcriptions. The proposed model achieves new state-of-the-art
performance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of
0.9% on LRS3, a relative improvement of 30% over the current state-of-the-art
approach, and outperforms methods that have been trained on non-publicly
available datasets with 26 times more training data.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures [19.823015917720284]
We evaluate the utility of synthetic data for training automatic speech recognition.
We reproduce the original training data, training ASR systems solely on synthetic data.
We show that the TTS models generalize well, even when training scores indicate overfitting.
arXiv Detail & Related papers (2024-07-25T12:44:45Z) - BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition [72.51848069125822]
We propose BRAVEn, an extension to the RAVEn method, which learns speech representations entirely from raw audio-visual data.
Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods.
Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
arXiv Detail & Related papers (2024-04-02T16:48:20Z) - LiteVSR: Efficient Visual Speech Recognition by Learning from Speech
Representations of Unlabeled Data [9.049193356646635]
Our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks.
Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware.
arXiv Detail & Related papers (2023-12-15T12:04:24Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.