Towards localisation of keywords in speech using weak supervision
- URL: http://arxiv.org/abs/2012.07396v1
- Date: Mon, 14 Dec 2020 10:30:51 GMT
- Title: Towards localisation of keywords in speech using weak supervision
- Authors: Kayode Olaleye, Benjamin van Niekerk, Herman Kamper
- Abstract summary: Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available.
We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly.
- Score: 30.67230721247154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developments in weakly supervised and self-supervised models could enable
speech technology in low-resource settings where full transcriptions are not
available. We consider whether keyword localisation is possible using two forms
of weak supervision where location information is not provided explicitly. In
the first, only the presence or absence of a word is indicated, i.e. a
bag-of-words (BoW) labelling. In the second, visual context is provided in the
form of an image paired with an unlabelled utterance; a model then needs to be
trained in a self-supervised fashion using the paired data. For keyword
localisation, we adapt a saliency-based method typically used in the vision
domain. We compare this to an existing technique that performs localisation as
a part of the network architecture. While the saliency-based method is more
flexible (it can be applied without architectural restrictions), we identify a
critical limitation when using it for keyword localisation. Of the two forms of
supervision, the visually trained model performs worse than the BoW-trained
model. We show qualitatively that the visually trained model sometimes locate
semantically related words, but this is not consistent. While our results show
that there is some signal allowing for localisation, it also calls for other
localisation methods better matched to these forms of weak supervision.
Related papers
- CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding [91.97362831507434]
Unsupervised visual grounding has been developed to locate regions using pseudo-labels.
We propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels.
Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-05-15T14:42:02Z) - Towards visually prompted keyword localisation for zero-resource spoken
languages [27.696096343873215]
We formalise the task of visually prompted keyword localisation (VPKL)
VPKL is given an image of a keyword, detect and predict where in an utterance the keyword occurs.
We show that these innovations give improvements in VPKL over an existing speech-vision model.
arXiv Detail & Related papers (2022-10-12T14:17:34Z) - What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Adapting CLIP For Phrase Localization Without Further Training [30.467802103692378]
We propose to leverage contrastive language-vision models, CLIP, pre-trained on image and caption pairs.
We adapt CLIP to generate high-resolution spatial feature maps.
Our method for phrase localization requires no human annotations or additional training.
arXiv Detail & Related papers (2022-04-07T17:59:38Z) - Keyword localisation in untranscribed speech using visually grounded
speech models [21.51901080054713]
Keywords localisation is the task of finding where in a speech utterance a given query keyword occurs.
VGS models are trained on unlabelled images paired with spoken captions.
Masked-based localisation gives some of the best reported localisation scores from a VGS model.
arXiv Detail & Related papers (2022-02-02T16:14:29Z) - Attention-Based Keyword Localisation in Speech using Visual Grounding [32.170748231414365]
We investigate whether visually grounded speech models can also do keyword localisation.
We show that attention provides a large gain in performance over previous visually grounded models.
As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions.
arXiv Detail & Related papers (2021-06-16T15:29:11Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Distant Supervision and Noisy Label Learning for Low Resource Named
Entity Recognition: A Study on Hausa and Yor\`ub\'a [23.68953940000046]
Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way.
We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario.
arXiv Detail & Related papers (2020-03-18T17:48:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.