Robust Self-Supervised Audio-Visual Speech Recognition
- URL: http://arxiv.org/abs/2201.01763v1
- Date: Wed, 5 Jan 2022 18:50:50 GMT
- Title: Robust Self-Supervised Audio-Visual Speech Recognition
- Authors: Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed
- Abstract summary: We present a self-supervised audio-visual speech recognition framework built upon Audio-Visual HuBERT (AV-HuBERT)
On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by 50% (28.0% vs. 14.1%) using less than 10% of labeled data.
Our approach reduces the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
- Score: 29.526786921769613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-based automatic speech recognition (ASR) degrades significantly in
noisy environments and is particularly vulnerable to interfering speech, as the
model cannot determine which speaker to transcribe. Audio-visual speech
recognition (AVSR) systems improve robustness by complementing the audio stream
with the visual information that is invariant to noise and helps the model
focus on the desired speaker. However, previous AVSR work focused solely on the
supervised learning setup; hence the progress was hindered by the amount of
labeled data available. In this work, we present a self-supervised AVSR
framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art
audio-visual speech representation learning model. On the largest available
AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by
~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in
the presence of babble noise, while reducing the WER of an audio-based model by
over 75% (25.8% vs. 5.8%) on average.
Related papers
- XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement [18.193191170754744]
We introduce AV2Wav, a re-synthesis-based audio-visual speech enhancement approach.
We use continuous rather than discrete representations to retain prosody and speaker information.
Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test.
arXiv Detail & Related papers (2023-09-14T21:07:53Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT [37.343431783936126]
This paper investigates self-supervised pre-training for audio-visual speaker representation learning.
A visual stream showing the speaker's mouth area is used alongside speech as inputs.
We conducted extensive experiments probing the effectiveness of pre-training and visual modality.
arXiv Detail & Related papers (2022-05-15T04:48:41Z) - Learning Audio-Visual Speech Representation by Masked Multimodal Cluster
Prediction [26.27172574676212]
Video recordings of speech contain correlated audio and visual information.
We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech.
AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
arXiv Detail & Related papers (2022-01-05T17:40:45Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.