Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of
Downstream Tasks
- URL: http://arxiv.org/abs/2210.14712v1
- Date: Wed, 26 Oct 2022 13:45:14 GMT
- Title: Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of
Downstream Tasks
- Authors: Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna Filighera, Abraham
Owodunni, and Daniel Whitenack
- Abstract summary: In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families.
Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages.
- Score: 0.007696728525672149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Bloom Library, a linguistically diverse set of multimodal and
multilingual datasets for language modeling, image captioning, visual
storytelling, and speech synthesis/recognition. These datasets represent either
the most, or among the most, multilingual datasets for each of the included
downstream tasks. In total, the initial release of the Bloom Library datasets
covers 363 languages across 32 language families. We train downstream task
models for various languages represented in the data, showing the viability of
the data for future work in low-resource, multimodal NLP and establishing the
first known baselines for these downstream tasks in certain languages (e.g.,
Bisu [bzi], with an estimated population of 700 users). Some of these
first-of-their-kind baselines are comparable to state-of-the-art performance
for higher-resourced languages. The Bloom Library datasets are released under
Creative Commons licenses on the Hugging Face datasets hub to catalyze more
linguistically diverse research in the included downstream tasks.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic
Classification in 200+ Languages and Dialects [9.501383449039142]
We created SIB-200 -- a large-scale benchmark dataset for topic classification in 200 languages and dialects.
For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for Natural Language Understanding.
We found that languages unseen during the pre-training of multilingual language models, under-represented language families, and languages from the regions of Africa, Americas, Oceania and South East Asia often have the lowest performance on our topic classification dataset.
arXiv Detail & Related papers (2023-09-14T05:56:49Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Large Scale Multi-Lingual Multi-Modal Summarization Dataset [26.92121230628835]
We present the current largest multi-lingual multi-modal summarization dataset (M3LS)
It consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair.
It is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages.
arXiv Detail & Related papers (2023-02-13T18:00:23Z) - EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form
Summarization in the Legal Domain [2.4815579733050157]
We propose a novel dataset, called EUR-Lex-Sum, based on manually curated document summaries of legal acts from the European Union law platform (EUR-Lex)
Documents and their respective summaries exist as cross-lingual paragraph-aligned data in several of the 24 official European languages.
We obtain up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
arXiv Detail & Related papers (2022-10-24T17:58:59Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.