AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR
- URL: http://arxiv.org/abs/2303.16501v1
- Date: Wed, 29 Mar 2023 07:24:28 GMT
- Title: AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR
- Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
- Abstract summary: We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
- Score: 79.21857972093332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audiovisual automatic speech recognition (AV-ASR) aims to improve the
robustness of a speech recognition system by incorporating visual information.
Training fully supervised multimodal models for this task from scratch, however
is limited by the need for large labelled audiovisual datasets (in each
downstream domain of interest). We present AVFormer, a simple method for
augmenting audio-only models with visual information, at the same time
performing lightweight domain adaptation. We do this by (i) injecting visual
embeddings into a frozen ASR model using lightweight trainable adaptors. We
show that these can be trained on a small amount of weakly labelled video data
with minimum additional training time and parameters. (ii) We also introduce a
simple curriculum scheme during training which we show is crucial to enable the
model to jointly process audio and visual information effectively; and finally
(iii) we show that our model achieves state of the art zero-shot results on
three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also
crucially preserving decent performance on traditional audio-only speech
recognition benchmarks (LibriSpeech). Qualitative results show that our model
effectively leverages visual information for robust speech recognition.
Related papers
- CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech.
We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models.
Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations.
Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks.
For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.