TaTa: A Multilingual Table-to-Text Dataset for African Languages
- URL: http://arxiv.org/abs/2211.00142v1
- Date: Mon, 31 Oct 2022 21:05:42 GMT
- Title: TaTa: A Multilingual Table-to-Text Dataset for African Languages
- Authors: Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan A. Botha,
Michael Chavinda, Ankur Parikh, Clara Rivera
- Abstract summary: Table-to-Text in African languages (TaTa) is the first large multilingual table-to-text dataset with a focus on African languages.
TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorub'a) and a zero-shot test language (Russian)
- Score: 32.348630887289524
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Existing data-to-text generation datasets are mostly limited to English. To
address this lack of data, we create Table-to-Text in African languages (TaTa),
the first large multilingual table-to-text dataset with a focus on African
languages. We created TaTa by transcribing figures and accompanying text in
bilingual reports by the Demographic and Health Surveys Program, followed by
professional translation to make the dataset fully parallel. TaTa includes
8,700 examples in nine languages including four African languages (Hausa, Igbo,
Swahili, and Yor\`ub\'a) and a zero-shot test language (Russian). We
additionally release screenshots of the original figures for future research on
multilingual multi-modal approaches. Through an in-depth human evaluation, we
show that TaTa is challenging for current models and that less than half the
outputs from an mT5-XXL-based model are understandable and attributable to the
source data. We further demonstrate that existing metrics perform poorly for
TaTa and introduce learned metrics that achieve a high correlation with human
judgments. We release all data and annotations at
https://github.com/google-research/url-nlp.
Related papers
- Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Ngambay-French Neural Machine Translation (sba-Fr) [16.55378462843573]
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers.
In this project, we created the first sba-Fr dataset, which is a corpus of Ngambay-to-French translations.
Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data.
arXiv Detail & Related papers (2023-08-25T17:13:20Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages [11.581072296148031]
We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset.
Our experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages.
arXiv Detail & Related papers (2022-09-22T18:01:27Z) - MLS: A Large-Scale Multilingual Dataset for Speech Research [37.803100082550294]
The dataset is derived from read audiobooks from LibriVox.
It consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages.
arXiv Detail & Related papers (2020-12-07T01:53:45Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.