Multimodal Speech Recognition with Unstructured Audio Masking
- URL: http://arxiv.org/abs/2010.08642v1
- Date: Fri, 16 Oct 2020 21:49:20 GMT
- Title: Multimodal Speech Recognition with Unstructured Audio Masking
- Authors: Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott
- Abstract summary: We simulate a more realistic masking scenario during model training, called RandWordMask.
Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words.
Our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted.
- Score: 49.01826387664443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual context has been shown to be useful for automatic speech recognition
(ASR) systems when the speech signal is noisy or corrupted. Previous work,
however, has only demonstrated the utility of visual context in an unrealistic
setting, where a fixed set of words are systematically masked in the audio. In
this paper, we simulate a more realistic masking scenario during model
training, called RandWordMask, where the masking can occur for any word
segment. Our experiments on the Flickr 8K Audio Captions Corpus show that
multimodal ASR can generalize to recover different types of masked words in
this unstructured masking setting. Moreover, our analysis shows that our models
are capable of attending to the visual signal when the audio signal is
corrupted. These results show that multimodal ASR systems can leverage the
visual signal in more generalized noisy scenarios.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - MaskSR: Masked Language Model for Full-band Speech Restoration [7.015213589171985]
Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions.
We propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth.
arXiv Detail & Related papers (2024-06-04T08:23:57Z) - EnCodecMAE: Leveraging neural codecs for universal audio representation learning [16.590638305972632]
We propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments.
We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds.
arXiv Detail & Related papers (2023-09-14T02:21:53Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Fine-Grained Grounding for Multimodal Speech Recognition [49.01826387664443]
We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
arXiv Detail & Related papers (2020-10-05T23:06:24Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Looking Enhances Listening: Recovering Missing Speech Using Images [40.616935661628155]
We present a set of experiments where we show the utility of the visual modality under noisy conditions.
Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations.
arXiv Detail & Related papers (2020-02-13T17:12:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.