Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
- URL: http://arxiv.org/abs/2409.12370v1
- Date: Thu, 19 Sep 2024 00:08:28 GMT
- Title: Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
- Authors: Yihan Wu, Yifan Peng, Yichen Lu, Xuankai Chang, Ruihua Song, Shinji Watanabe,
- Abstract summary: We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
- Score: 67.75334989582709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual signals can enhance audiovisual speech recognition accuracy by providing additional contextual information. Given the complexity of visual signals, an audiovisual speech recognition model requires robust generalization capabilities across diverse video scenarios, presenting a significant challenge. In this paper, we introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for ``in-the-wild'' videos. Specifically, we first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Then, we build EVA upon a robust pretrained speech recognition model, ensuring its generalization ability. Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module. Experiments show our model achieves state-of-the-art results on three benchmarks, which demonstrates the generalization ability of EVA across diverse video domains.
Related papers
- CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech.
We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models.
Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z) - VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - A multimodal dynamical variational autoencoder for audiovisual speech
representation learning [23.748108659645844]
multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning.
Experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition.
arXiv Detail & Related papers (2023-05-05T14:37:26Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.