Learning Audio-Visual Speech Representation by Masked Multimodal Cluster
Prediction
- URL: http://arxiv.org/abs/2201.02184v1
- Date: Wed, 5 Jan 2022 17:40:45 GMT
- Title: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster
Prediction
- Authors: Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed
- Abstract summary: Video recordings of speech contain correlated audio and visual information.
We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech.
AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
- Score: 26.27172574676212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video recordings of speech contain correlated audio and visual information,
providing a strong signal for speech representation learning from the speaker's
lip movements and the produced sound. We introduce Audio-Visual Hidden Unit
BERT (AV-HuBERT), a self-supervised representation learning framework for
audio-visual speech, which masks multi-stream video input and predicts
automatically discovered and iteratively refined multimodal hidden units.
AV-HuBERT learns powerful audio-visual speech representation benefiting both
lip-reading and automatic speech recognition. On the largest public lip-reading
benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of
labeled data, outperforming the former state-of-the-art approach (33.6%)
trained with a thousand times more transcribed video data (31K hours). The
lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled
data from LRS3 and combined with self-training. Using our audio-visual
representation on the same benchmark for audio-only speech recognition leads to
a 40% relative WER reduction over the state-of-the-art performance (1.3% vs
2.3%). Our code and models are available at
https://github.com/facebookresearch/av_hubert
Related papers
- A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition [0.0]
Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication.
We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment to train a multimodal model with a shared latent representation.
To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER.
arXiv Detail & Related papers (2024-03-02T21:15:24Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Robust Self-Supervised Audio-Visual Speech Recognition [29.526786921769613]
We present a self-supervised audio-visual speech recognition framework built upon Audio-Visual HuBERT (AV-HuBERT)
On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by 50% (28.0% vs. 14.1%) using less than 10% of labeled data.
Our approach reduces the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
arXiv Detail & Related papers (2022-01-05T18:50:50Z) - Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval [13.40010612226968]
Speech-based image retrieval has been studied as a proxy for joint representation learning.
It is unclear how well speech-based retrieval can work in practice.
We show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
arXiv Detail & Related papers (2021-04-05T13:11:40Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.