Towards visually prompted keyword localisation for zero-resource spoken
languages
- URL: http://arxiv.org/abs/2210.06229v1
- Date: Wed, 12 Oct 2022 14:17:34 GMT
- Title: Towards visually prompted keyword localisation for zero-resource spoken
languages
- Authors: Leanne Nortje and Herman Kamper
- Abstract summary: We formalise the task of visually prompted keyword localisation (VPKL)
VPKL is given an image of a keyword, detect and predict where in an utterance the keyword occurs.
We show that these innovations give improvements in VPKL over an existing speech-vision model.
- Score: 27.696096343873215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imagine being able to show a system a visual depiction of a keyword and
finding spoken utterances that contain this keyword from a zero-resource speech
corpus. We formalise this task and call it visually prompted keyword
localisation (VPKL): given an image of a keyword, detect and predict where in
an utterance the keyword occurs. To do VPKL, we propose a speech-vision model
with a novel localising attention mechanism which we train with a new keyword
sampling scheme. We show that these innovations give improvements in VPKL over
an existing speech-vision model. We also compare to a visual bag-of-words (BoW)
model where images are automatically tagged with visual labels and paired with
unlabelled speech. Although this visual BoW can be queried directly with a
written keyword (while our's takes image queries), our new model still
outperforms the visual BoW in both detection and localisation, giving a 16%
relative improvement in localisation F1.
Related papers
- Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words [0.0]
This paper introduces a framework for automatically annotating natural scene images with local semantic labels from a predefined vocabulary.
The framework is based on a hypothesis that assumes that, in natural scenes, intermediate semantic concepts are correlated with the local keypoints.
Based on this hypothesis, image regions can be efficiently represented by BOW model and using a machine learning approach, such as SVM, to label image regions with semantic annotations.
arXiv Detail & Related papers (2022-10-17T12:57:51Z) - Keyword localisation in untranscribed speech using visually grounded
speech models [21.51901080054713]
Keywords localisation is the task of finding where in a speech utterance a given query keyword occurs.
VGS models are trained on unlabelled images paired with spoken captions.
Masked-based localisation gives some of the best reported localisation scores from a VGS model.
arXiv Detail & Related papers (2022-02-02T16:14:29Z) - Visual Information Guided Zero-Shot Paraphrase Generation [71.33405403748237]
We propose visual information guided zero-shot paraphrase generation (ViPG) based only on paired image-caption data.
It jointly trains an image captioning model and a paraphrasing model and leverage the image captioning model to guide the training of the paraphrasing model.
Both automatic evaluation and human evaluation show our model can generate paraphrase with good relevancy, fluency and diversity.
arXiv Detail & Related papers (2022-01-22T18:10:39Z) - Visual Keyword Spotting with Attention [82.79015266453533]
We investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword.
We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods.
We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
arXiv Detail & Related papers (2021-10-29T17:59:04Z) - Attention-Based Keyword Localisation in Speech using Visual Grounding [32.170748231414365]
We investigate whether visually grounded speech models can also do keyword localisation.
We show that attention provides a large gain in performance over previous visually grounded models.
As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions.
arXiv Detail & Related papers (2021-06-16T15:29:11Z) - Towards localisation of keywords in speech using weak supervision [30.67230721247154]
Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available.
We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly.
arXiv Detail & Related papers (2020-12-14T10:30:51Z) - Fine-Grained Grounding for Multimodal Speech Recognition [49.01826387664443]
We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
arXiv Detail & Related papers (2020-10-05T23:06:24Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.