The mutual exclusivity bias of bilingual visually grounded speech models
- URL: http://arxiv.org/abs/2506.04037v1
- Date: Wed, 04 Jun 2025 14:59:22 GMT
- Title: The mutual exclusivity bias of bilingual visually grounded speech models
- Authors: Dan Oneata, Leanne Nortje, Yevgen Matusevych, Herman Kamper,
- Abstract summary: Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one.<n>Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images.<n>We explore this pattern using bilingual VGS models trained on combinations of English, French, and Dutch.
- Score: 22.97008687596735
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: https://github.com/danoneata/me-vgs
Related papers
- Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling [4.340338299803563]
We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images.
We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba.
arXiv Detail & Related papers (2024-09-03T17:59:50Z) - Why do LLaVA Vision-Language Models Reply to Images in English? [15.727116803057633]
We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs)
Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query.
arXiv Detail & Related papers (2024-07-02T15:01:55Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Visually Grounded Speech Models have a Mutual Exclusivity Bias [20.495178526318185]
When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias.
This bias has been studied computationally, but only in models that use discrete word representations as input.
We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio.
arXiv Detail & Related papers (2024-03-20T18:49:59Z) - Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.<n>We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.<n>Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality.
In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Like a bilingual baby: The advantage of visually grounding a bilingual
language model [0.0]
We train an LSTM language model on images and captions in English and Spanish from MS-COCO-ES.
We find that the visual grounding improves the model's understanding of semantic similarity both within and across languages and improves perplexity.
Our results provide additional evidence of the advantages of visually grounded language models and point to the need for more naturalistic language data from multilingual speakers and multilingual datasets with perceptual grounding.
arXiv Detail & Related papers (2022-10-11T14:43:26Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.