Looking Enhances Listening: Recovering Missing Speech Using Images
- URL: http://arxiv.org/abs/2002.05639v1
- Date: Thu, 13 Feb 2020 17:12:51 GMT
- Title: Looking Enhances Listening: Recovering Missing Speech Using Images
- Authors: Tejas Srinivasan, Ramon Sanabria, Florian Metze
- Abstract summary: We present a set of experiments where we show the utility of the visual modality under noisy conditions.
Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations.
- Score: 40.616935661628155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech is understood better by using visual context; for this reason, there
have been many attempts to use images to adapt automatic speech recognition
(ASR) systems. Current work, however, has shown that visually adapted ASR
models only use images as a regularization signal, while completely ignoring
their semantic content. In this paper, we present a set of experiments where we
show the utility of the visual modality under noisy conditions. Our results
show that multimodal ASR models can recover words which are masked in the input
acoustic signal, by grounding its transcriptions using the visual
representations. We observe that integrating visual context can result in up to
35% relative improvement in masked word recovery. These results demonstrate
that end-to-end multimodal ASR systems can become more robust to noise by
leveraging the visual context.
Related papers
- Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning [25.743503223389784]
We propose a reinforcement learning (RL) based framework called MSRL.
We customize a reward function directly related to task-specific metrics.
Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions.
arXiv Detail & Related papers (2022-12-10T14:01:54Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - Multimodal Speech Recognition with Unstructured Audio Masking [49.01826387664443]
We simulate a more realistic masking scenario during model training, called RandWordMask.
Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words.
Our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted.
arXiv Detail & Related papers (2020-10-16T21:49:20Z) - Fine-Grained Grounding for Multimodal Speech Recognition [49.01826387664443]
We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
arXiv Detail & Related papers (2020-10-05T23:06:24Z) - Correlating Subword Articulation with Lip Shapes for Embedding Aware
Audio-Visual Speech Enhancement [94.0676772764248]
We propose a visual embedding approach to improving embedding aware speech enhancement (EASE)
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE)
Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE)
arXiv Detail & Related papers (2020-09-21T01:26:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.