BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
- URL: http://arxiv.org/abs/2305.17202v1
- Date: Fri, 26 May 2023 18:49:55 GMT
- Title: BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
- Authors: Claytone Sikasote, Eunice Mukonde, Md Mahfuz Ibn Alam, Antonios
Anastasopoulos
- Abstract summary: The dataset is comprised of multi-turn dialogues between Bemba speakers based on images, transcribed and translated into English.
There are more than 92,000 utterances/sentences, amounting to more than 180 hours of audio data with corresponding transcriptions and English translations.
- Score: 30.058814706934147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present BIG-C (Bemba Image Grounded Conversations), a large multimodal
dataset for Bemba. While Bemba is the most populous language of Zambia, it
exhibits a dearth of resources which render the development of language
technologies or language processing research almost impossible. The dataset is
comprised of multi-turn dialogues between Bemba speakers based on images,
transcribed and translated into English. There are more than 92,000
utterances/sentences, amounting to more than 180 hours of audio data with
corresponding transcriptions and English translations. We also provide
baselines on speech recognition (ASR), machine translation (MT) and speech
translation (ST) tasks, and sketch out other potential future multimodal uses
of our dataset. We hope that by making the dataset available to the research
community, this work will foster research and encourage collaboration across
the language, speech, and vision communities especially for languages outside
the "traditionally" used high-resourced ones. All data and code are publicly
available: https://github.com/csikasote/bigc.
Related papers
- Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Ngambay-French Neural Machine Translation (sba-Fr) [16.55378462843573]
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers.
In this project, we created the first sba-Fr dataset, which is a corpus of Ngambay-to-French translations.
Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data.
arXiv Detail & Related papers (2023-08-25T17:13:20Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally.
Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets.
We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z) - MLS: A Large-Scale Multilingual Dataset for Speech Research [37.803100082550294]
The dataset is derived from read audiobooks from LibriVox.
It consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages.
arXiv Detail & Related papers (2020-12-07T01:53:45Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - CoVoST 2 and Massively Multilingual Speech-to-Text Translation [24.904548615918355]
CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages.
This represents the largest open dataset available to date from total volume and language coverage perspective.
arXiv Detail & Related papers (2020-07-20T17:53:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.