YFACC: A Yor\`ub\'a speech-image dataset for cross-lingual keyword
localisation through visual grounding
- URL: http://arxiv.org/abs/2210.04600v2
- Date: Wed, 12 Oct 2022 07:55:39 GMT
- Title: YFACC: A Yor\`ub\'a speech-image dataset for cross-lingual keyword
localisation through visual grounding
- Authors: Kayode Olaleye, Dan Oneata, Herman Kamper
- Abstract summary: We release a new dataset of audio captions for 6k Flickr images in Yorub'a -- a real low-resource language spoken in Nigeria.
We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yorub'a utterances.
This enables cross-lingual keyword localisation: a written English query is detected and located in Yorub'a speech.
- Score: 21.51901080054713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visually grounded speech (VGS) models are trained on images paired with
unlabelled spoken captions. Such models could be used to build speech systems
in settings where it is impossible to get labelled data, e.g. for documenting
unwritten languages. However, most VGS studies are in English or other
high-resource languages. This paper attempts to address this shortcoming. We
collect and release a new single-speaker dataset of audio captions for 6k
Flickr images in Yor\`ub\'a -- a real low-resource language spoken in Nigeria.
We train an attention-based VGS model where images are automatically tagged
with English visual labels and paired with Yor\`ub\'a utterances. This enables
cross-lingual keyword localisation: a written English query is detected and
located in Yor\`ub\'a speech. To quantify the effect of the smaller dataset, we
compare to English systems trained on similar and more data. We hope that this
new dataset will stimulate research in the use of VGS models for real
low-resource languages.
Related papers
- Translating speech with just images [23.104041372055466]
We extend this connection by linking images to text via an existing image captioning system.
This approach can be used for speech translation with just images by having the audio in a different language from the generated captions.
We investigate such a system on a real low-resource language, Yorub'a, and propose a Yorub'a-to-English speech translation model.
arXiv Detail & Related papers (2024-06-11T10:29:24Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - Towards Practical and Efficient Image-to-Speech Captioning with
Vision-Language Pre-training and Multi-modal Tokens [87.52235889917223]
We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model.
With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases.
arXiv Detail & Related papers (2023-09-15T16:48:34Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - "Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks [20.837515947519524]
First sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia.
In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a wikily translation of the English captioning data.
Our captioning results in Arabic are slightly better than that of its supervised model.
arXiv Detail & Related papers (2021-04-16T21:49:12Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Cross-modal Language Generation using Pivot Stabilization for Web-scale
Language Coverage [23.71195344840051]
Cross-modal language generation tasks such as image captioning are directly hurt by the trend of data-hungry models combined with the lack of non-English annotations.
We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations and their machine-translated versions.
We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages.
arXiv Detail & Related papers (2020-05-01T06:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.