Visually Grounded Keyword Detection and Localisation for Low-Resource
Languages
- URL: http://arxiv.org/abs/2302.00765v1
- Date: Wed, 1 Feb 2023 21:32:15 GMT
- Title: Visually Grounded Keyword Detection and Localisation for Low-Resource
Languages
- Authors: Kayode Kolawole Olaleye
- Abstract summary: The study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech.
Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%.
A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study investigates the use of Visually Grounded Speech (VGS) models for
keyword localisation in speech. The study focusses on two main research
questions: (1) Is keyword localisation possible with VGS models and (2) Can
keyword localisation be done cross-lingually in a real low-resource setting?
Four methods for localisation are proposed and evaluated on an English dataset,
with the best-performing method achieving an accuracy of 57%. A new dataset
containing spoken captions in Yoruba language is also collected and released
for cross-lingual keyword localisation. The cross-lingual model obtains a
precision of 16% in actual keyword localisation and this performance can be
improved by initialising from a model pretrained on English data. The study
presents a detailed analysis of the model's success and failure modes and
highlights the challenges of using VGS models for keyword localisation in
low-resource settings.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - Locale Encoding For Scalable Multilingual Keyword Spotting Models [8.385848547707953]
We propose two locale-conditioned universalmodels with locale feature concatenation and feature-wise linearmodulation.
FiLM performed the best, improving on average FRRby 61% (relative) compared to monolingual KWS models of similarsizes.
arXiv Detail & Related papers (2023-02-25T02:20:59Z) - Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better
Than Unsupervised? [8.594972401685649]
We study whether pretrained multilingual language models can be employed for zero-shot cross-lingual keyword extraction on low-resource languages.
The comparison is conducted on six news article datasets covering two high-resource languages, English and Russian, and four low-resource languages.
We find that the pretrained models fine-tuned on a multilingual corpus covering languages that do not appear in the test set, consistently outscore unsupervised models in all six languages.
arXiv Detail & Related papers (2022-02-14T12:06:45Z) - Keyword localisation in untranscribed speech using visually grounded
speech models [21.51901080054713]
Keywords localisation is the task of finding where in a speech utterance a given query keyword occurs.
VGS models are trained on unlabelled images paired with spoken captions.
Masked-based localisation gives some of the best reported localisation scores from a VGS model.
arXiv Detail & Related papers (2022-02-02T16:14:29Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - From Masked Language Modeling to Translation: Non-English Auxiliary
Tasks Improve Zero-shot Spoken Language Understanding [24.149299722716155]
We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect.
We propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer.
Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.
arXiv Detail & Related papers (2021-05-15T23:51:11Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem.
Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters.
Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.