Related papers: Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?

Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?

URL: http://arxiv.org/abs/2402.13917v2
Date: Fri, 5 Apr 2024 05:26:05 GMT
Title: Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?
Authors: Ryandito Diandaru, Lucky Susanto, Zilu Tang, Ayu Purwarianti, Derry Wijaya,
Abstract summary: Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on. Our study delves into Llama2's translation capabilities. Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen.
Score: 4.655168524016426
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on. However, the impact of factors beyond training data size on translation performance remains a topic of debate, especially concerning languages not directly encountered during training. Our study delves into Llama2's translation capabilities. By modeling a linear relationship between linguistic feature distances and machine translation scores, we ask ourselves if there are potentially better central languages for LLMs other than English. Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen, which rarely happens for languages it has not seen. Most translation improvements into unseen languages come from scaling up the model size rather than instruction tuning or increasing shot count. Furthermore, our correlation analysis reveals that syntactic similarity is not the only linguistic factor that strongly correlates with machine translation scores. Interestingly, we discovered that under specific circumstances, some languages (e.g. Swedish, Catalan), despite having significantly less training data, exhibit comparable correlation levels to English. These insights challenge the prevailing landscape of LLMs, suggesting that models centered around languages other than English could provide a more efficient foundation for multilingual applications.

Related papers

Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family [0.9422186097220215]
This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation.<n>We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations.
arXiv Detail & Related papers (2025-06-29T17:21:05Z)
CoCo-CoLa: Evaluating Language Adherence in Multilingual LLMs [1.2057938662974816]
Large Language Models (LLMs) develop cross-lingual abilities despite being trained on limited parallel data. We introduce CoCo-CoLa, a novel metric to evaluate language adherence in multilingual LLMs.
arXiv Detail & Related papers (2025-02-18T03:03:53Z)
The Impact of Model Scaling on Seen and Unseen Language Performance [2.012425476229879]
We study the performance and scaling behavior of multilingual Large Language Models across 204 languages. Our findings show significant differences in scaling behavior between zero-shot and two-shot scenarios. In two-shot settings, larger models show clear linear improvements in multilingual text classification.
arXiv Detail & Related papers (2025-01-10T00:10:21Z)
Pruning Multilingual Large Language Models for Multilingual Inference [28.36717615166238]
This study explores how to enhance the zero-shot performance of MLLMs in non-English languages. We first analyze the behavior of MLLMs when performing translation and reveal that there are large magnitude features that play a critical role in the translation process.
arXiv Detail & Related papers (2024-09-25T13:15:50Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance. We observe that the existence of a predominant language during training boosts the performance of less frequent languages. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z)
Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions [68.01449013641532]
Large-scale Pretrained Language Models (LLMs) have shown strong abilities in multilingual translations. We present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7B, to perform multilingual translation.
arXiv Detail & Related papers (2023-05-24T12:00:24Z)
How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning [14.02101305717738]
Multilingual large language models (MLLMs) are jointly trained on data from many different languages. It remains unclear to what extent, and under which conditions, languages rely on each other's data. We find that MLLMs rely on data from multiple languages from the early stages of fine-tuning and that this reliance gradually increases as fine-tuning progresses.
arXiv Detail & Related papers (2023-05-22T17:47:41Z)
Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models [4.731313022026271]
In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT) Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.
arXiv Detail & Related papers (2022-09-14T05:20:52Z)
Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text. This leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z)
Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models. We focus on the Indic languages, which have the highest script diversity in the world. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z)
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages. We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC) LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language. We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.