Mitigating Catastrophic Forgetting in Language Transfer via Model Merging
- URL: http://arxiv.org/abs/2407.08699v2
- Date: Tue, 16 Jul 2024 07:48:06 GMT
- Title: Mitigating Catastrophic Forgetting in Language Transfer via Model Merging
- Authors: Anton Alexandrov, Veselin Raychev, Mark Niklas Müller, Ce Zhang, Martin Vechev, Kristina Toutanova,
- Abstract summary: Branch-and-Merge (BaM) is a new adaptation method based on iteratively merging multiple models.
BaM is based on the insight that this yields lower magnitude but higher quality weight changes.
We demonstrate in an empirical study on Bulgarian and German that BaM can significantly reduce forgetting while matching or even improving target domain performance.
- Score: 16.845734486667226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As open-weight large language models (LLMs) achieve ever more impressive performances across a wide range of tasks in English, practitioners aim to adapt these models to different languages. However, such language adaptation is often accompanied by catastrophic forgetting of the base model's capabilities, severely limiting the usefulness of the resulting model. We address this issue by proposing Branch-and-Merge (BaM), a new adaptation method based on iteratively merging multiple models, fine-tuned on a subset of the available training data. BaM is based on the insight that this yields lower magnitude but higher quality weight changes, reducing forgetting of the source domain while maintaining learning on the target domain. We demonstrate in an extensive empirical study on Bulgarian and German that BaM can significantly reduce forgetting while matching or even improving target domain performance compared to both standard continued pretraining and instruction finetuning across different model architectures.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Neural Machine Translation Models Can Learn to be Few-shot Learners [2.2999148299770042]
We show that a much smaller model can be trained to perform in-context learning (ICL)
With this capacity for ICL, the model can take advantage of relevant few-shot examples to adapt its output towards the domain.
Our approach allows efficient batch inference on a mix of domains and outperforms state-of-the-art baselines in terms of both translation quality and immediate adaptation rate.
arXiv Detail & Related papers (2023-09-15T17:44:21Z) - Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual
Machine Translation [8.7660229706359]
Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation.
We show effective regularization strategies that prevent over-fitting and improve the performance of MoE models on low-resource tasks.
arXiv Detail & Related papers (2022-12-15T01:06:55Z) - EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning [38.928786416891924]
We introduce efficient and effective massively multilingual sentence embedding (EMS) using cross-lingual token-level reconstruction (XTR) and sentence-level contrastive learning as training objectives.
Compared with related studies, the proposed model can be efficiently trained using significantly fewer parallel sentences and GPU computation resources.
We release the codes for model training and the EMS pre-trained sentence embedding model, which supports 62 languages.
arXiv Detail & Related papers (2022-05-31T12:29:25Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.