Fine-Grained Grounding for Multimodal Speech Recognition
- URL: http://arxiv.org/abs/2010.02384v1
- Date: Mon, 5 Oct 2020 23:06:24 GMT
- Title: Fine-Grained Grounding for Multimodal Speech Recognition
- Authors: Tejas Srinivasan, Ramon Sanabria, Florian Metze and Desmond Elliott
- Abstract summary: We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
- Score: 49.01826387664443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal automatic speech recognition systems integrate information from
images to improve speech recognition quality, by grounding the speech in the
visual context. While visual signals have been shown to be useful for
recovering entities that have been masked in the audio, these models should be
capable of recovering a broader range of word types. Existing systems rely on
global visual features that represent the entire image, but localizing the
relevant regions of the image will make it possible to recover a larger set of
words, such as adjectives and verbs. In this paper, we propose a model that
uses finer-grained visual information from different parts of the image, using
automatic object proposals. In experiments on the Flickr8K Audio Captions
Corpus, we find that our model improves over approaches that use global visual
features, that the proposals enable the model to recover entities and other
related words, such as adjectives, and that improvements are due to the model's
ability to localize the correct proposals.
Related papers
- Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.
Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.
Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z) - Late fusion ensembles for speech recognition on diverse input audio representations [0.0]
We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models.
We show that improvements of $1% - 14%$ can still be achieved over the state-of-the-art models trained using comparable techniques.
arXiv Detail & Related papers (2024-12-01T10:19:24Z) - VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Looking Enhances Listening: Recovering Missing Speech Using Images [40.616935661628155]
We present a set of experiments where we show the utility of the visual modality under noisy conditions.
Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations.
arXiv Detail & Related papers (2020-02-13T17:12:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.