Memory-efficient NLLB-200: Language-specific Expert Pruning of a
Massively Multilingual Machine Translation Model
- URL: http://arxiv.org/abs/2212.09811v3
- Date: Fri, 7 Jul 2023 09:53:20 GMT
- Title: Memory-efficient NLLB-200: Language-specific Expert Pruning of a
Massively Multilingual Machine Translation Model
- Authors: Yeskendir Koishekenov, Alexandre Berard, Vassilina Nikoulina
- Abstract summary: NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages.
We propose a pruning method that enables the removal of up to 80% of experts without further finetuning.
- Score: 92.91310997807936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently released NLLB-200 is a set of multilingual Neural Machine
Translation models that cover 202 languages. The largest model is based on a
Mixture of Experts architecture and achieves SoTA results across many language
pairs. It contains 54.5B parameters and requires at least four 32GB GPUs just
for inference. In this work, we propose a pruning method that enables the
removal of up to 80% of experts without further finetuning and with a
negligible loss in translation quality, which makes it feasible to run the
model on a single 32GB GPU. Further analysis suggests that our pruning metrics
can identify language-specific experts.
Related papers
- CULL-MT: Compression Using Language and Layer pruning for Machine Translation [2.565964707090901]
We present CULL-MT, a compression method for machine translation models based on structural layer pruning and selected language directions.
We find the NLLB-3.3B model to be robust, allowing 25% of layers to be pruned with only a 0.9 spBLEU drop.
However, LLaMA3.1-8B-Instruct is more sensitive, with a 2.0 spBLEU drop after pruning 5 layers.
arXiv Detail & Related papers (2024-11-10T16:05:11Z) - How Multilingual Are Large Language Models Fine-Tuned for Translation? [13.612090779277281]
Fine-tuning large language models (LLM) on parallel text has been shown to outperform dedicated translation systems trained in a supervised fashion on much larger amounts of parallel data.
How does translation fine-tuning impact the MT capabilities of LLMs for zero-shot languages, zero-shot language pairs, and translation tasks that do not involve English?
We find that translation fine-tuning improves translation quality even for zero-shot languages on average, but that the impact is uneven depending on the language pairs involved.
arXiv Detail & Related papers (2024-05-30T22:08:20Z) - Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages.
It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts.
The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z) - LIMIT: Language Identification, Misidentification, and Translation using
Hierarchical Models in 350+ Languages [27.675441924635294]
Current systems cannot accurately identify most of the world's 7000 languages.
We first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages.
We propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification.
arXiv Detail & Related papers (2023-05-23T17:15:43Z) - Investigating the Translation Performance of a Large Multilingual
Language Model: the Case of BLOOM [8.858671209228536]
We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets.
We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.
arXiv Detail & Related papers (2023-03-03T13:23:42Z) - SMaLL-100: Introducing Shallow Multilingual Machine Translation Model
for Low-Resource Languages [102.50127671423752]
We introduce SMaLL-100, a distilled version of the M2M-100 (12B) machine translation model covering 100 languages.
We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages.
Our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.
arXiv Detail & Related papers (2022-10-20T22:32:29Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.