SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic
Classification in 200+ Languages and Dialects
- URL: http://arxiv.org/abs/2309.07445v3
- Date: Thu, 7 Mar 2024 13:16:08 GMT
- Title: SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic
Classification in 200+ Languages and Dialects
- Authors: David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev,
Jesujoba O. Alabi, Yanke Mao, Haonan Gao, Annie En-Shiun Lee
- Abstract summary: We created SIB-200 -- a large-scale benchmark dataset for topic classification in 200 languages and dialects.
For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for Natural Language Understanding.
We found that languages unseen during the pre-training of multilingual language models, under-represented language families, and languages from the regions of Africa, Americas, Oceania and South East Asia often have the lowest performance on our topic classification dataset.
- Score: 9.501383449039142
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the progress we have recorded in the last few years in multilingual
natural language processing, evaluation is typically limited to a small set of
languages with available datasets which excludes a large number of low-resource
languages. In this paper, we created SIB-200 -- a large-scale open-sourced
benchmark dataset for topic classification in 200 languages and dialects to
address the lack of evaluation dataset for Natural Language Understanding
(NLU). For many of the languages covered in SIB-200, this is the first publicly
available evaluation dataset for NLU. The dataset is based on Flores-200
machine translation corpus. We annotated the English portion of the dataset and
extended the sentence-level annotation to the remaining 203 languages covered
in the corpus. Despite the simplicity of this task, our evaluation in
full-supervised setting, cross-lingual transfer setting and prompting of large
language model setting show that there is still a large gap between the
performance of high-resource and low-resource languages when multilingual
evaluation is scaled to numerous world languages. We found that languages
unseen during the pre-training of multilingual language models,
under-represented language families (like Nilotic and Altantic-Congo), and
languages from the regions of Africa, Americas, Oceania and South East Asia,
often have the lowest performance on our topic classification dataset. We hope
our dataset will encourage a more inclusive evaluation of multilingual language
models on a more diverse set of languages. https://github.com/dadelani/sib-200
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - UIO at SemEval-2023 Task 12: Multilingual fine-tuning for sentiment
classification in low-resource languages [0.0]
We show how a multilingual large language model can be a resource for sentiment analysis in languages not seen during pretraining.
The languages are to various degrees related to languages used during pretraining, and the language data contain various degrees of code-switching.
We experiment with both monolingual and multilingual datasets for the final fine-tuning, and find that with the provided datasets that contain samples in the thousands, monolingual fine-tuning yields the best results.
arXiv Detail & Related papers (2023-04-27T13:51:18Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.