Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
- URL: http://arxiv.org/abs/2205.07180v1
- Date: Sun, 15 May 2022 04:48:41 GMT
- Title: Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
- Authors: Bowen Shi and Abdelrahman Mohamed and Wei-Ning Hsu
- Abstract summary: This paper investigates self-supervised pre-training for audio-visual speaker representation learning.
A visual stream showing the speaker's mouth area is used alongside speech as inputs.
We conducted extensive experiments probing the effectiveness of pre-training and visual modality.
- Score: 37.343431783936126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates self-supervised pre-training for audio-visual speaker
representation learning where a visual stream showing the speaker's mouth area
is used alongside speech as inputs. Our study focuses on the Audio-Visual
Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose
audio-visual speech pre-training framework. We conducted extensive experiments
probing the effectiveness of pre-training and visual modality. Experimental
results suggest that AV-HuBERT generalizes decently to speaker related
downstream tasks, improving label efficiency by roughly ten fold for both
audio-only and audio-visual speaker verification. We also show that
incorporating visual information, even just the lip area, greatly improves the
performance and noise robustness, reducing EER by 38% in the clean condition
and 75% in noisy conditions. Our code and models will be publicly available.
Related papers
- AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - Robust Self-Supervised Audio-Visual Speech Recognition [29.526786921769613]
We present a self-supervised audio-visual speech recognition framework built upon Audio-Visual HuBERT (AV-HuBERT)
On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by 50% (28.0% vs. 14.1%) using less than 10% of labeled data.
Our approach reduces the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
arXiv Detail & Related papers (2022-01-05T18:50:50Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.