Too Brittle To Touch: Comparing the Stability of Quantization and
Distillation Towards Developing Lightweight Low-Resource MT Models
- URL: http://arxiv.org/abs/2210.15184v1
- Date: Thu, 27 Oct 2022 05:30:13 GMT
- Title: Too Brittle To Touch: Comparing the Stability of Quantization and
Distillation Towards Developing Lightweight Low-Resource MT Models
- Authors: Harshita Diddee, Sandipan Dandapat, Monojit Choudhury, Tanuja Ganu,
Kalika Bali
- Abstract summary: State-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages.
Knowledge Distillation is one popular technique to develop competitive, lightweight models.
- Score: 12.670354498961492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leveraging shared learning through Massively Multilingual Models,
state-of-the-art machine translation models are often able to adapt to the
paucity of data for low-resource languages. However, this performance comes at
the cost of significantly bloated models which are not practically deployable.
Knowledge Distillation is one popular technique to develop competitive,
lightweight models: In this work, we first evaluate its use to compress MT
models focusing on languages with extremely limited training data. Through our
analysis across 8 languages, we find that the variance in the performance of
the distilled models due to their dependence on priors including the amount of
synthetic data used for distillation, the student architecture, training
hyperparameters and confidence of the teacher models, makes distillation a
brittle compression mechanism. To mitigate this, we explore the use of
post-training quantization for the compression of these models. Here, we find
that while distillation provides gains across some low-resource languages,
quantization provides more consistent performance trends for the entire range
of languages, especially the lowest-resource languages in our target set.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models [2.2871867623460216]
This paper investigates the effectiveness of pruning, knowledge distillation, and quantization on an exclusively low-resourced, small-data language model, AfriBERTa.
Through a battery of experiments, we assess the effects of compression on performance across several metrics beyond accuracy.
arXiv Detail & Related papers (2024-04-06T23:52:53Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - Continual Knowledge Distillation for Neural Machine Translation [74.03622486218597]
parallel corpora are not publicly accessible for data copyright, data privacy and competitive differentiation reasons.
We propose a method called continual knowledge distillation to take advantage of existing translation models to improve one model of interest.
arXiv Detail & Related papers (2022-12-18T14:41:13Z) - Intriguing Properties of Compression on Multilingual Models [17.06142742945346]
We propose a framework to characterize the impact of sparsifying multilingual pre-trained language models during fine-tuning.
Applying this framework to mBERT named entity recognition models across 40 languages, we find that compression confers several intriguing and previously unknown generalization properties.
arXiv Detail & Related papers (2022-11-04T20:28:01Z) - A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained
Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings.
We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z) - What Do Compressed Multilingual Machine Translation Models Forget? [102.50127671423752]
We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases.
We demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.
arXiv Detail & Related papers (2022-05-22T13:54:44Z) - Collective Wisdom: Improving Low-resource Neural Machine Translation
using Adaptive Knowledge Distillation [42.38435539241788]
Scarcity of parallel sentence-pairs poses a significant hurdle for training high-quality Neural Machine Translation (NMT) models in bilingually low-resource scenarios.
We propose an adaptive knowledge distillation approach to dynamically adjust the contribution of the teacher models during the distillation process.
Experiments on transferring from a collection of six language pairs from IWSLT to five low-resource language-pairs from TED Talks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-10-12T04:26:46Z) - XtremeDistil: Multi-stage Distillation for Massive Multilingual Models [19.393371230300225]
We study knowledge distillation with a focus on multi-lingual Named Entity Recognition (NER)
We propose a stage-wise optimization scheme leveraging teacher internal representations that is agnostic of teacher architecture.
We show that our approach leads to massive compression of MBERT-like teacher models by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95% of its F1-score for NER over 41 languages.
arXiv Detail & Related papers (2020-04-12T19:49:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.