Related papers: CULL-MT: Compression Using Language and Layer pruning for Machine Translation

CULL-MT: Compression Using Language and Layer pruning for Machine Translation

URL: http://arxiv.org/abs/2411.06506v1
Date: Sun, 10 Nov 2024 16:05:11 GMT
Title: CULL-MT: Compression Using Language and Layer pruning for Machine Translation
Authors: Pedram Rostami, Mohammad Javad Dousti,
Abstract summary: We present CULL-MT, a compression method for machine translation models based on structural layer pruning and selected language directions. We find the NLLB-3.3B model to be robust, allowing 25% of layers to be pruned with only a 0.9 spBLEU drop. However, LLaMA3.1-8B-Instruct is more sensitive, with a 2.0 spBLEU drop after pruning 5 layers.
Score: 2.565964707090901
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multilingual machine translation models often outperform traditional bilingual models by leveraging translation knowledge transfer. Recent advancements have led to these models supporting hundreds of languages and achieving state-of-the-art results across various translation directions. However, as these models grow larger, their inference operations become increasingly costly. In many use cases, there is no need to support such a wide range of language pairs, as translation is typically needed in only a few selected directions. In this paper, we present CULL-MT, a compression method for machine translation models based on structural layer pruning and selected language directions. Our approach identifies and prunes unimportant layers using a greedy strategy, then mitigates the impact by applying knowledge distillation from the original model along with parameter-efficient fine-tuning. We apply CULL-MT to the NLLB-3.3B and LLaMA3.1-8B-Instruct models. In a multi-way translation scenario (Persian, French, and German to English), we find the NLLB-3.3B model to be robust, allowing 25% of layers to be pruned with only a 0.9 spBLEU drop. However, LLaMA3.1-8B-Instruct is more sensitive, with a 2.0 spBLEU drop after pruning 5 layers.

Related papers

Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data. We develop a multilingual neural machine translation (MNMT) model based on languages relatedness. We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z)
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models [27.777372498182864]
We propose a novel fine-tuning approach for Generative Large Language Models (LLMs) Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance.
arXiv Detail & Related papers (2023-09-20T22:53:15Z)
Learning Language-Specific Layers for Multilingual Machine Translation [1.997704019887898]
We introduce Language-Specific Transformer Layers (LSLs) LSLs allow us to increase model capacity, while keeping the amount of computation and the number of parameters used in the forward pass constant. We study the best way to place these layers using a neural architecture search inspired approach, and achieve an improvement of 1.3 chrF (1.5 spBLEU) points over not using LSLs on a separate decoder architecture, and 1.9 chrF (2.2 spBLEU) on a shared decoder one.
arXiv Detail & Related papers (2023-05-04T09:18:05Z)
Lego-MT: Learning Detachable Models for Massively Multilingual Machine Translation [48.37939354609931]
We propose a novel efficient training recipe, upon which we build an effective detachable model, Lego-MT. Experiments show that Lego-MT with 1.2B parameters brings an average gain of 3.2 spBLEU. The proposed training recipe brings a 28.2$times$ speedup over the conventional multi-way training method.
arXiv Detail & Related papers (2022-12-20T18:54:08Z)
Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model [92.91310997807936]
NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages. We propose a pruning method that enables the removal of up to 80% of experts without further finetuning.
arXiv Detail & Related papers (2022-12-19T19:29:40Z)
Building Multilingual Machine Translation Systems That Serve Arbitrary X-Y Translations [75.73028056136778]
We show how to practically build MNMT systems that serve arbitrary X-Y translation directions. We also examine our proposed approach in an extremely large-scale data setting to accommodate practical deployment scenarios.
arXiv Detail & Related papers (2022-06-30T02:18:15Z)
What Do Compressed Multilingual Machine Translation Models Forget? [102.50127671423752]
We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases. We demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.
arXiv Detail & Related papers (2022-05-22T13:54:44Z)
Examining Scaling and Transfer of Language Model Architectures for Machine Translation [51.69212730675345]
Language models (LMs) process sequences in a single stack of layers, and encoder-decoder models (EncDec) utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs.
arXiv Detail & Related papers (2022-02-01T16:20:15Z)
Improving Multilingual Translation by Representation and Gradient Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
Adapting Monolingual Models: Data can be Scarce when Language Similarity is High [3.249853429482705]
We investigate the performance of zero-shot transfer learning with as little data as possible. We retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties. With high language similarity, 10MB of data appears sufficient to achieve substantial monolingual transfer performance.
arXiv Detail & Related papers (2021-05-06T17:43:40Z)
Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics. We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.