Keyword localisation in untranscribed speech using visually grounded
speech models
- URL: http://arxiv.org/abs/2202.01107v1
- Date: Wed, 2 Feb 2022 16:14:29 GMT
- Title: Keyword localisation in untranscribed speech using visually grounded
speech models
- Authors: Kayode Olaleye, Dan Oneata and Herman Kamper
- Abstract summary: Keywords localisation is the task of finding where in a speech utterance a given query keyword occurs.
VGS models are trained on unlabelled images paired with spoken captions.
Masked-based localisation gives some of the best reported localisation scores from a VGS model.
- Score: 21.51901080054713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keyword localisation is the task of finding where in a speech utterance a
given query keyword occurs. We investigate to what extent keyword localisation
is possible using a visually grounded speech (VGS) model. VGS models are
trained on unlabelled images paired with spoken captions. These models are
therefore self-supervised -- trained without any explicit textual label or
location information. To obtain training targets, we first tag training images
with soft text labels using a pretrained visual classifier with a fixed
vocabulary. This enables a VGS model to predict the presence of a written
keyword in an utterance, but not its location. We consider four ways to equip
VGS models with localisations capabilities. Two of these -- a saliency approach
and input masking -- can be applied to an arbitrary prediction model after
training, while the other two -- attention and a score aggregation approach --
are incorporated directly into the structure of the model. Masked-based
localisation gives some of the best reported localisation scores from a VGS
model, with an accuracy of 57% when the system knows that a keyword occurs in
an utterance and need to predict its location. In a setting where localisation
is performed after detection, an $F_1$ of 25% is achieved, and in a setting
where a keyword spotting ranking pass is first performed, we get a localisation
P@10 of 32%. While these scores are modest compared to the idealised setting
with unordered bag-of-word-supervision (from transcriptions), these models do
not receive any textual or location supervision. Further analyses show that
these models are limited by the first detection or ranking pass. Moreover,
individual keyword localisation performance is correlated with the tagging
performance from the visual classifier. We also show qualitatively how and
where semantic mistakes occur, e.g. that the model locates surfer when queried
with ocean.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Visually Grounded Keyword Detection and Localisation for Low-Resource
Languages [0.0]
The study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech.
Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%.
A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation.
arXiv Detail & Related papers (2023-02-01T21:32:15Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Towards visually prompted keyword localisation for zero-resource spoken
languages [27.696096343873215]
We formalise the task of visually prompted keyword localisation (VPKL)
VPKL is given an image of a keyword, detect and predict where in an utterance the keyword occurs.
We show that these innovations give improvements in VPKL over an existing speech-vision model.
arXiv Detail & Related papers (2022-10-12T14:17:34Z) - Attention-Based Keyword Localisation in Speech using Visual Grounding [32.170748231414365]
We investigate whether visually grounded speech models can also do keyword localisation.
We show that attention provides a large gain in performance over previous visually grounded models.
As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions.
arXiv Detail & Related papers (2021-06-16T15:29:11Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Towards localisation of keywords in speech using weak supervision [30.67230721247154]
Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available.
We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly.
arXiv Detail & Related papers (2020-12-14T10:30:51Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.