Fine-Grained Grounding for Multimodal Speech Recognition
- URL: http://arxiv.org/abs/2010.02384v1
- Date: Mon, 5 Oct 2020 23:06:24 GMT
- Title: Fine-Grained Grounding for Multimodal Speech Recognition
- Authors: Tejas Srinivasan, Ramon Sanabria, Florian Metze and Desmond Elliott
- Abstract summary: We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
- Score: 49.01826387664443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal automatic speech recognition systems integrate information from
images to improve speech recognition quality, by grounding the speech in the
visual context. While visual signals have been shown to be useful for
recovering entities that have been masked in the audio, these models should be
capable of recovering a broader range of word types. Existing systems rely on
global visual features that represent the entire image, but localizing the
relevant regions of the image will make it possible to recover a larger set of
words, such as adjectives and verbs. In this paper, we propose a model that
uses finer-grained visual information from different parts of the image, using
automatic object proposals. In experiments on the Flickr8K Audio Captions
Corpus, we find that our model improves over approaches that use global visual
features, that the proposals enable the model to recover entities and other
related words, such as adjectives, and that improvements are due to the model's
ability to localize the correct proposals.
Related papers
- VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Looking Enhances Listening: Recovering Missing Speech Using Images [40.616935661628155]
We present a set of experiments where we show the utility of the visual modality under noisy conditions.
Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations.
arXiv Detail & Related papers (2020-02-13T17:12:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.