Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages
- URL: http://arxiv.org/abs/2305.08487v2
- Date: Tue, 4 Jun 2024 15:03:12 GMT
- Title: Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages
- Authors: Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Renhao Pei, Ehsaneddin Asgari, Hinrich Schütze,
- Abstract summary: We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
- Score: 40.01333053375582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While natural language processing tools have been developed extensively for some of the world's languages, a significant portion of the world's over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.
Related papers
- Universal Cross-Lingual Text Classification [0.3958317527488535]
This research proposes a novel perspective on Universal Cross-Lingual Text Classification.
Our approach involves blending supervised data from different languages during training to create a universal model.
The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages.
arXiv Detail & Related papers (2024-06-16T17:58:29Z) - IndicSTR12: A Dataset for Indic Scene Text Recognition [33.194567434881314]
This paper proposes the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages.
The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries.
The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language.
arXiv Detail & Related papers (2024-03-12T18:14:48Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic
Classification in 200+ Languages and Dialects [9.501383449039142]
We created SIB-200 -- a large-scale benchmark dataset for topic classification in 200 languages and dialects.
For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for Natural Language Understanding.
We found that languages unseen during the pre-training of multilingual language models, under-represented language families, and languages from the regions of Africa, Americas, Oceania and South East Asia often have the lowest performance on our topic classification dataset.
arXiv Detail & Related papers (2023-09-14T05:56:49Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - TaTa: A Multilingual Table-to-Text Dataset for African Languages [32.348630887289524]
Table-to-Text in African languages (TaTa) is the first large multilingual table-to-text dataset with a focus on African languages.
TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorub'a) and a zero-shot test language (Russian)
arXiv Detail & Related papers (2022-10-31T21:05:42Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.