Related papers: The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

URL: http://arxiv.org/abs/2407.17479v1
Date: Mon, 1 Jul 2024 23:01:41 GMT
Title: The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain
Authors: María Grandury,
Abstract summary: We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs) We present how we have created as an international open-source community the first versions of the instruction and evaluation datasets.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.

Related papers

La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America [33.48097838499165]
We present La Leaderboard, the first open-source leaderboard to evaluate generative Large Language Models.<n>This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties.<n>We explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task.
arXiv Detail & Related papers (2025-07-01T17:50:48Z)
FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models [1.2403152094314245]
We introduce FORMOSANBENCH, the first benchmark for evaluating large language models (LLMs) on low-resource Austronesian languages.<n>We assess model performance in zero-shot, 10-shot, and fine-tuned settings using FORMOSANBENCH.<n>Our results reveal a substantial performance gap between high-resource and Formosan languages.
arXiv Detail & Related papers (2025-06-12T07:02:28Z)
Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects [0.0]
We aim to pioneer NLP technologies for Comorian, a group of four languages or dialects belonging to the Bantu family. Our approach is motivated by the hypothesis that if a human can understand a different language from their native language with little or no effort, it would be entirely possible to model this process on a machine.
arXiv Detail & Related papers (2024-12-09T22:47:41Z)
Tagengo: A Multilingual Chat Dataset [3.8073142980733]
We present a high quality dataset of more than 70k prompt-response pairs in 74 languages. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually.
arXiv Detail & Related papers (2024-05-21T09:06:36Z)
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model [33.87586041774359]
Aya is a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages. We conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
arXiv Detail & Related papers (2024-02-12T17:34:13Z)
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning [49.79783940841352]
Existing datasets are almost all in the English language. We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions. We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia. Most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z)
\`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural Language Generation of Dialogues in Low-Resource, African Languages [0.9511471519043974]
We investigate the possibility of cross-lingual transfer from a state-of-the-art (SoTA) deep monolingual model to 6 African languages. The languages are Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorub'a. The results show that the hypothesis that deep monolingual models learn some abstractions that generalise across languages holds.
arXiv Detail & Related papers (2022-04-17T20:23:04Z)
AllWOZ: Towards Multilingual Task-Oriented Dialog Systems for All [41.10368284872525]
This paper presents AllWOZ, a multilingual task-oriented customer service dialog dataset covering eight languages. We create a benchmark for our multilingual dataset by applying mT5 with meta-learning.
arXiv Detail & Related papers (2021-12-15T18:30:51Z)
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.