Visually Grounded Speech Models have a Mutual Exclusivity Bias
- URL: http://arxiv.org/abs/2403.13922v1
- Date: Wed, 20 Mar 2024 18:49:59 GMT
- Title: Visually Grounded Speech Models have a Mutual Exclusivity Bias
- Authors: Leanne Nortje, Dan Oneaţă, Yevgen Matusevych, Herman Kamper,
- Abstract summary: When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias.
This bias has been studied computationally, but only in models that use discrete word representations as input.
We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio.
- Score: 20.495178526318185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: a novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete word representations as input, ignoring the high variability of spoken words. We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio. Concretely, we train a model on familiar words and test its ME bias by asking it to select between a novel and a familiar object when queried with a novel word. To simulate prior acoustic and visual knowledge, we experiment with several initialisation strategies using pretrained speech and vision networks. Our findings reveal the ME bias across the different initialisation approaches, with a stronger bias in models with more prior (in particular, visual) knowledge. Additional tests confirm the robustness of our results, even when different loss functions are considered.
Related papers
- Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Are words equally surprising in audio and audio-visual comprehension? [13.914373331208774]
We compare the ERP signature (N400) associated with each word in audio-only and audio-visual presentations of the same verbal stimuli.
Our results indicate that cognitive effort differs significantly between multimodal and unimodal settings.
This highlights the significant impact of local lexical context on cognitive processing in a multimodal environment.
arXiv Detail & Related papers (2023-07-14T11:17:37Z) - Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Modelling word learning and recognition using visually grounded speech [18.136170489933082]
Computational models of speech recognition often assume that the set of target words is already given.
This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision.
Visually grounded speech models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input.
arXiv Detail & Related papers (2022-03-14T08:59:37Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Investigating Novel Verb Learning in BERT: Selectional Preference
Classes and Alternation-Based Syntactic Generalization [22.112988757841467]
We deploy a novel word-learning paradigm to test BERT's few-shot learning capabilities for two aspects of English verbs.
We find that BERT makes robust grammatical generalizations after just one or two instances of a novel word in fine-tuning.
arXiv Detail & Related papers (2020-11-04T17:17:49Z) - Visually Grounded Compound PCFGs [65.04669567781634]
Exploiting visual groundings for language understanding has recently been drawing much attention.
We study visually grounded grammar induction and learn a constituency from both unlabeled text and its visual captions.
arXiv Detail & Related papers (2020-09-25T19:07:00Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.