What Drives Performance in Multilingual Language Models?
- URL: http://arxiv.org/abs/2404.19159v1
- Date: Mon, 29 Apr 2024 23:49:19 GMT
- Title: What Drives Performance in Multilingual Language Models?
- Authors: Sina Bagheri Nezhad, Ameeta Agrawal,
- Abstract summary: This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages.
We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset.
- Score: 1.7648680700685022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages. We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages, highlighting the importance of cross-lingual transfer learning. Notably, model size and architecture do not significantly alter the most important features identified. Our findings provide valuable insights into the strengths and limitations of current MLLMs and hope to guide the development of more effective and equitable multilingual NLP systems.
Related papers
- Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - The Impact of Model Scaling on Seen and Unseen Language Performance [2.012425476229879]
We study the performance and scaling behavior of multilingual Large Language Models across 204 languages.
Our findings show significant differences in scaling behavior between zero-shot and two-shot scenarios.
In two-shot settings, larger models show clear linear improvements in multilingual text classification.
arXiv Detail & Related papers (2025-01-10T00:10:21Z) - Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models [1.5703073293718952]
Token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance.
These insights offer valuable guidance for developing more equitable and effective multilingual language models.
arXiv Detail & Related papers (2024-12-17T03:05:26Z) - Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models [22.859955360764275]
We introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test to assess a model's ability to retrieve relevant information.
We evaluate four state-of-the-art large language models on MLNeedle.
arXiv Detail & Related papers (2024-08-19T17:02:06Z) - Exploring the Maze of Multilingual Modeling [2.0849578298972835]
We present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3.
Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features.
arXiv Detail & Related papers (2023-10-09T04:48:14Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning [14.02101305717738]
Multilingual large language models (MLLMs) are jointly trained on data from many different languages.
It remains unclear to what extent, and under which conditions, languages rely on each other's data.
We find that MLLMs rely on data from multiple languages from the early stages of fine-tuning and that this reliance gradually increases as fine-tuning progresses.
arXiv Detail & Related papers (2023-05-22T17:47:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.