BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition
- URL: http://arxiv.org/abs/2404.02098v1
- Date: Tue, 2 Apr 2024 16:48:20 GMT
- Title: BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition
- Authors: Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic,
- Abstract summary: We propose BRAVEn, an extension to the RAVEn method, which learns speech representations entirely from raw audio-visual data.
Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods.
Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
- Score: 72.51848069125822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation [0.0]
A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets.
We present an inclusive ASR design approach, leveraging self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation.
Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech.
arXiv Detail & Related papers (2024-06-14T16:56:40Z) - LiteVSR: Efficient Visual Speech Recognition by Learning from Speech
Representations of Unlabeled Data [9.049193356646635]
Our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks.
Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware.
arXiv Detail & Related papers (2023-12-15T12:04:24Z) - SynthVSR: Scaling Up Visual Speech Recognition With Synthetic
Supervision [60.54020550732634]
We study the potential of leveraging synthetic visual data for visual speech recognition (VSR)
Key idea is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech.
We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3)
arXiv Detail & Related papers (2023-03-30T07:43:27Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.