Related papers: Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?

Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?

URL: http://arxiv.org/abs/2312.12683v2
Date: Thu, 03 Oct 2024 17:27:28 GMT
Title: Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?
Authors: Tannon Kew, Florian Schottmann, Rico Sennrich,
Abstract summary: We investigate the minimal amount of multilinguality required to elicit cross-lingual generalisation in English-centric large language models. We find that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation.
Score: 40.13166574854085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The vast majority of today's large language models (LLMs) are English-centric, having been pretrained predominantly on English text. Yet, in order to meet user expectations, models need to be able to respond appropriately in multiple languages once deployed in downstream applications. This requires strong cross-lingual transfer abilities. In this work, we investigate the minimal amount of multilinguality required during finetuning to elicit cross-lingual generalisation in English-centric LLMs. In experiments across four LLMs, we find that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation, with the limiting factor being the degree to which a target language is seen during pretraining. Evaluations on five different tasks further reveal that multilingual instruction tuning is most beneficial for generative tasks that assume input/output language agreement, such as in chat settings, while being of less importance for highly structured classification-style tasks. Our code and data is available at https://github.com/ZurichNLP/multilingual-instruction-tuning.

Related papers

AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment [46.881574083116085]
Multilingual large language models (LLMs) possess impressive multilingual understanding and generation capabilities.<n>LLMs' performance and cross-lingual alignment often lag for non-dominant languages.<n>We propose AlignX to bridge the multilingual performance gap, which is a two-stage representation-level framework.
arXiv Detail & Related papers (2025-09-29T06:37:46Z)
CoCo-CoLa: Evaluating Language Adherence in Multilingual LLMs [1.2057938662974816]
Large Language Models (LLMs) develop cross-lingual abilities despite being trained on limited parallel data. We introduce CoCo-CoLa, a novel metric to evaluate language adherence in multilingual LLMs.
arXiv Detail & Related papers (2025-02-18T03:03:53Z)
Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs) It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs. It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models [79.46179534911019]
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to imbalanced training corpora. This work extends the evaluation from NLP tasks to real user queries. For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising.
arXiv Detail & Related papers (2024-03-15T12:47:39Z)
How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages. This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages [19.067718464786463]
We perform multilingual adaptive fine-tuning (MAFT) on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT. Our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space.
arXiv Detail & Related papers (2022-04-13T16:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.