Glot500: Scaling Multilingual Corpora and Language Models to 500
Languages
- URL: http://arxiv.org/abs/2305.12182v2
- Date: Fri, 26 May 2023 11:30:08 GMT
- Title: Glot500: Scaling Multilingual Corpora and Language Models to 500
Languages
- Authors: Ayyoob Imani and Peiqin Lin and Amir Hossein Kargaran and Silvia
Severini and Masoud Jalili Sabet and Nora Kassner and Chunlan Ma and Helmut
Schmid and Andr\'e F. T. Martins and Fran\c{c}ois Yvon and Hinrich Sch\"utze
- Abstract summary: Glot500-m is a horizontally scaled Large Language Models (LLMs) that covers 511 predominantly low-resource languages.
An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages.
We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline.
- Score: 8.298465385153527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The NLP community has mainly focused on scaling Large Language Models (LLMs)
vertically, i.e., making them better for about 100 languages. We instead scale
LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM
that covers 511 predominantly low-resource languages. An important part of this
effort is to collect and clean Glot500-c, a corpus that covers these 511
languages and allows us to train Glot500-m. We evaluate Glot500-m on five
diverse tasks across these languages. We observe large improvements for both
high-resource and low-resource languages compared to an XLM-R baseline. Our
analysis shows that no single factor explains the quality of multilingual LLM
representations. Rather, a combination of factors determines quality including
corpus size, script, "help" from related languages and the total capacity of
the model. Our work addresses an important goal of NLP research: we should not
limit NLP to a small fraction of the world's languages and instead strive to
support as many languages as possible to bring the benefits of NLP technology
to all languages and cultures. Code, data and models are available at
https://github.com/cisnlp/Glot500.
Related papers
- Goldfish: Monolingual Language Models for 350 Languages [23.365111479090626]
For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously.
We release Goldfish, a suite of monolingual autoregressive Transformer language models up to 125M parameters for 350 languages.
arXiv Detail & Related papers (2024-08-19T22:31:21Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - Hire a Linguist!: Learning Endangered Languages with In-Context
Linguistic Descriptions [52.95579788485411]
LINGOLLM is a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training.
We implement LINGOLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages.
Our results show that LINGOLLM elevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 language directions.
arXiv Detail & Related papers (2024-02-28T03:44:01Z) - MaLA-500: Massive Language Adaptation of Large Language Models [61.440556436524]
MaLA-500 is a novel large language model designed to cover an extensive range of 534 languages.
Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs.
arXiv Detail & Related papers (2024-01-24T08:57:39Z) - TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script.
We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.