Attention-Based Keyword Localisation in Speech using Visual Grounding
- URL: http://arxiv.org/abs/2106.08859v1
- Date: Wed, 16 Jun 2021 15:29:11 GMT
- Title: Attention-Based Keyword Localisation in Speech using Visual Grounding
- Authors: Kayode Olaleye and Herman Kamper
- Abstract summary: We investigate whether visually grounded speech models can also do keyword localisation.
We show that attention provides a large gain in performance over previous visually grounded models.
As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions.
- Score: 32.170748231414365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visually grounded speech models learn from images paired with spoken
captions. By tagging images with soft text labels using a trained visual
classifier with a fixed vocabulary, previous work has shown that it is possible
to train a model that can detect whether a particular text keyword occurs in
speech utterances or not. Here we investigate whether visually grounded speech
models can also do keyword localisation: predicting where, within an utterance,
a given textual keyword occurs without any explicit text-based or alignment
supervision. We specifically consider whether incorporating attention into a
convolutional model is beneficial for localisation. Although absolute
localisation performance with visually supervised models is still modest
(compared to using unordered bag-of-word text labels for supervision), we show
that attention provides a large gain in performance over previous visually
grounded models. As in many other speech-image studies, we find that many of
the incorrect localisations are due to semantic confusions, e.g. locating the
word 'backstroke' for the query keyword 'swimming'.
Related papers
- Pixel Aligned Language Models [94.32841818609914]
We develop a vision-language model that can take locations as either inputs or outputs.
When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.
Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention.
arXiv Detail & Related papers (2023-12-14T18:57:58Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words [0.0]
This paper introduces a framework for automatically annotating natural scene images with local semantic labels from a predefined vocabulary.
The framework is based on a hypothesis that assumes that, in natural scenes, intermediate semantic concepts are correlated with the local keypoints.
Based on this hypothesis, image regions can be efficiently represented by BOW model and using a machine learning approach, such as SVM, to label image regions with semantic annotations.
arXiv Detail & Related papers (2022-10-17T12:57:51Z) - Towards visually prompted keyword localisation for zero-resource spoken
languages [27.696096343873215]
We formalise the task of visually prompted keyword localisation (VPKL)
VPKL is given an image of a keyword, detect and predict where in an utterance the keyword occurs.
We show that these innovations give improvements in VPKL over an existing speech-vision model.
arXiv Detail & Related papers (2022-10-12T14:17:34Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Keyword localisation in untranscribed speech using visually grounded
speech models [21.51901080054713]
Keywords localisation is the task of finding where in a speech utterance a given query keyword occurs.
VGS models are trained on unlabelled images paired with spoken captions.
Masked-based localisation gives some of the best reported localisation scores from a VGS model.
arXiv Detail & Related papers (2022-02-02T16:14:29Z) - Towards localisation of keywords in speech using weak supervision [30.67230721247154]
Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available.
We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly.
arXiv Detail & Related papers (2020-12-14T10:30:51Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.