Looking Enhances Listening: Recovering Missing Speech Using Images
- URL: http://arxiv.org/abs/2002.05639v1
- Date: Thu, 13 Feb 2020 17:12:51 GMT
- Title: Looking Enhances Listening: Recovering Missing Speech Using Images
- Authors: Tejas Srinivasan, Ramon Sanabria, Florian Metze
- Abstract summary: We present a set of experiments where we show the utility of the visual modality under noisy conditions.
Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations.
- Score: 40.616935661628155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech is understood better by using visual context; for this reason, there
have been many attempts to use images to adapt automatic speech recognition
(ASR) systems. Current work, however, has shown that visually adapted ASR
models only use images as a regularization signal, while completely ignoring
their semantic content. In this paper, we present a set of experiments where we
show the utility of the visual modality under noisy conditions. Our results
show that multimodal ASR models can recover words which are masked in the input
acoustic signal, by grounding its transcriptions using the visual
representations. We observe that integrating visual context can result in up to
35% relative improvement in masked word recovery. These results demonstrate
that end-to-end multimodal ASR systems can become more robust to noise by
leveraging the visual context.
Related papers
- VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning [25.743503223389784]
We propose a reinforcement learning (RL) based framework called MSRL.
We customize a reward function directly related to task-specific metrics.
Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions.
arXiv Detail & Related papers (2022-12-10T14:01:54Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - Multimodal Speech Recognition with Unstructured Audio Masking [49.01826387664443]
We simulate a more realistic masking scenario during model training, called RandWordMask.
Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words.
Our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted.
arXiv Detail & Related papers (2020-10-16T21:49:20Z) - Fine-Grained Grounding for Multimodal Speech Recognition [49.01826387664443]
We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
arXiv Detail & Related papers (2020-10-05T23:06:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.