LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
- URL: http://arxiv.org/abs/2407.05975v2
- Date: Sat, 12 Oct 2024 03:20:44 GMT
- Title: LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
- Authors: Yinquan Lu, Wenhao Zhu, Lei Li, Yu Qiao, Fei Yuan,
- Abstract summary: Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks.
But their performance in low-resource languages is hindered by insufficient multilingual data during pre-training.
We conduct extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages.
- Score: 36.52198103816494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we conduct extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs (by more than 10 spBLEU points) and performs on-par with specialized translation model (M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code \footnote{\url{https://github.com/CONE-MT/LLaMAX/.}} and the models \footnote{\url{https://huggingface.co/LLaMAX/.}} are publicly available.
Related papers
- FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data [39.54285525397304]
We present FuxiTranyu, an open-source multilingual model for large language models (LLMs)
The base model, FuxiTranyu-8B, features 8 billion parameters and is trained from scratch on meticulously balanced multilingual data.
Experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu.
arXiv Detail & Related papers (2024-08-12T16:34:56Z) - LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback [61.23008372927665]
We introduce xLLMs-100, which scales the multilingual capabilities of LLaMA and BLOOM to 100 languages.
We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks.
arXiv Detail & Related papers (2024-06-03T20:25:12Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - BigTranslate: Augmenting Large Language Models with Multilingual
Translation Capability over 100 Languages [47.99695189331567]
We present BigTranslate, which adapts LLaMA that covers only 20 languages and enhances it with multilingual translation capability on more than 100 languages.
BigTranslate is built upon LLaMA-13B and it is optimized in three steps. First, we continue training LLaMA with massive Chinese monolingual data. Second, we continue training the model with a large-scale parallel dataset that covers 102 natural languages. Third, we instruct-tune the foundation model with multilingual translation instructions, leading to our BigTranslate model.
arXiv Detail & Related papers (2023-05-29T14:07:52Z) - SMaLL-100: Introducing Shallow Multilingual Machine Translation Model
for Low-Resource Languages [102.50127671423752]
We introduce SMaLL-100, a distilled version of the M2M-100 (12B) machine translation model covering 100 languages.
We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages.
Our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.
arXiv Detail & Related papers (2022-10-20T22:32:29Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.