Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling
- URL: http://arxiv.org/abs/2409.02865v1
- Date: Tue, 3 Sep 2024 17:59:50 GMT
- Title: Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling
- Authors: Leanne Nortje,
- Abstract summary: We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images.
We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba.
- Score: 4.340338299803563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
Related papers
- The mutual exclusivity bias of bilingual visually grounded speech models [22.97008687596735]
Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one.<n>Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images.<n>We explore this pattern using bilingual VGS models trained on combinations of English, French, and Dutch.
arXiv Detail & Related papers (2025-06-04T14:59:22Z) - Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning [1.532756501930393]
We investigate the potential of pre-trained audio representations for detecting abusive language in low-resource languages.
Our approach integrates representations within the Model-Agnostic Meta-Learning framework to classify abusive language in 10 languages.
arXiv Detail & Related papers (2024-12-02T11:51:19Z) - Multilingual acoustic word embeddings for zero-resource languages [1.5229257192293204]
It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments.
The study introduces a new neural network that outperforms existing AWE models on zero-resource languages.
AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts.
arXiv Detail & Related papers (2024-01-19T08:02:37Z) - Pixel Aligned Language Models [94.32841818609914]
We develop a vision-language model that can take locations as either inputs or outputs.
When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.
Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention.
arXiv Detail & Related papers (2023-12-14T18:57:58Z) - Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality.
In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z) - Images in Language Space: Exploring the Suitability of Large Language
Models for Vision & Language Tasks [17.97052348690598]
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms.
multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models.
We make visual information accessible to the language model using separate verbalisation models.
arXiv Detail & Related papers (2023-05-23T07:50:36Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - YFACC: A Yor\`ub\'a speech-image dataset for cross-lingual keyword
localisation through visual grounding [21.51901080054713]
We release a new dataset of audio captions for 6k Flickr images in Yorub'a -- a real low-resource language spoken in Nigeria.
We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yorub'a utterances.
This enables cross-lingual keyword localisation: a written English query is detected and located in Yorub'a speech.
arXiv Detail & Related papers (2022-10-10T11:58:10Z) - Testing the Ability of Language Models to Interpret Figurative Language [69.59943454934799]
Figurative and metaphorical language are commonplace in discourse.
It remains an open question to what extent modern language models can interpret nonliteral phrases.
We introduce Fig-QA, a Winograd-style nonliteral language understanding task.
arXiv Detail & Related papers (2022-04-26T23:42:22Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.