Word Discovery in Visually Grounded, Self-Supervised Speech Models
- URL: http://arxiv.org/abs/2203.15081v5
- Date: Tue, 20 Jun 2023 01:55:28 GMT
- Title: Word Discovery in Visually Grounded, Self-Supervised Speech Models
- Authors: Puyuan Peng and David Harwath
- Abstract summary: We show that powerful word segmentation and clustering capability emerges within the model's self-attention heads.
Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models.
- Score: 13.956691231452336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a method for visually-grounded spoken term discovery. After
training either a HuBERT or wav2vec2.0 model to associate spoken captions with
natural images, we show that powerful word segmentation and clustering
capability emerges within the model's self-attention heads. Our experiments
reveal that this ability is not present to nearly the same extent in the base
HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a
crucial component of the word discovery capability we observe. We also evaluate
our method on the Buckeye word segmentation and ZeroSpeech spoken term
discovery tasks, where we perform on par with or better than currently
published methods on several metrics. Code and model weights are available at
https://github.com/jasonppy/word-discovery.
Related papers
- Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - World-to-Words: Grounded Open Vocabulary Acquisition through Fast
Mapping in Vision-Language Models [6.47452771256903]
We introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning.
We propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective.
We demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly.
arXiv Detail & Related papers (2023-06-14T18:10:05Z) - Syllable Discovery and Cross-Lingual Generalization in a Visually
Grounded, Self-Supervised Speech Model [21.286529902957724]
We show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian.
arXiv Detail & Related papers (2023-05-19T05:19:04Z) - Towards visually prompted keyword localisation for zero-resource spoken
languages [27.696096343873215]
We formalise the task of visually prompted keyword localisation (VPKL)
VPKL is given an image of a keyword, detect and predict where in an utterance the keyword occurs.
We show that these innovations give improvements in VPKL over an existing speech-vision model.
arXiv Detail & Related papers (2022-10-12T14:17:34Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Visual Keyword Spotting with Attention [82.79015266453533]
We investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword.
We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods.
We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
arXiv Detail & Related papers (2021-10-29T17:59:04Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - A Visuospatial Dataset for Naturalistic Verb Learning [18.654373173232205]
We introduce a new dataset for training and evaluating grounded language models.
Our data is collected within a virtual reality environment and is designed to emulate the quality of language data to which a pre-verbal child is likely to have access.
We use the collected data to compare several distributional semantics models for verb learning.
arXiv Detail & Related papers (2020-10-28T20:47:13Z) - Neural Twins Talk [0.0]
We introduce a novel twin cascaded attention model that outperforms a state-of-the-art image captioning model.
Visual grounding ensures the existence of words in the caption sentence that are grounded into a particular region in the input image.
We report the results of our experiments in three image captioning tasks on the COCO dataset.
arXiv Detail & Related papers (2020-09-26T06:58:58Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.