Related papers: Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

URL: http://arxiv.org/abs/2410.12937v1
Date: Wed, 16 Oct 2024 18:23:50 GMT
Title: Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging
Authors: Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge, Pradeep Dasigi,
Abstract summary: Adapting general-purpose language models to new skills is currently an expensive process. We investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model.
Score: 102.16497861225358
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors). In experiments focusing on scientific literature understanding, safety, and coding, we find that the parallel-train-then-merge procedure, which is significantly cheaper than retraining the models on updated data mixtures, is often comparably effective. Our experiments also show that parallel training is especially well-suited for enabling safety features in LMs relative to continued finetuning and retraining, as it dramatically improves model compliance with safe prompts while preserving its ability to refuse dangerous or harmful prompts.

Related papers

Resona: Improving Context Copying in Linear Recurrence Models with Retrieval [24.84741364872597]
We introduce __Resona__, a simple and scalable framework for augmenting linear recurrent models with retrieval. Experiments on a variety of linear recurrent models demonstrate significant performance gains on a variety of synthetic as well as real-world natural language tasks.
arXiv Detail & Related papers (2025-03-28T23:43:33Z)
Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z)
A Retention-Centric Framework for Continual Learning with Guaranteed Model Developmental Safety [75.8161094916476]
In real-world applications, learning-enabled systems often undergo iterative model development to address challenging or emerging tasks. New or improving existing capabilities may inadvertently lose good capabilities of the old model, also known as catastrophic forgetting. We propose a retention-centric framework with data-dependent constraints, and study how to continually develop a pretrained CLIP model for acquiring new or improving existing capabilities of image classification.
arXiv Detail & Related papers (2024-10-04T22:34:58Z)
Robustness-Congruent Adversarial Training for Secure Machine Learning Model Updates [13.911586916369108]
We show that misclassifications in machine-learning models can affect robustness to adversarial examples. We propose a technique, named robustness-congruent adversarial training, to address this issue. We show that our algorithm and, more generally, learning with non-regression constraints, provides a theoretically-grounded framework to train consistent estimators.
arXiv Detail & Related papers (2024-02-27T10:37:13Z)
Making Pre-trained Language Models Better Continual Few-Shot Relation Extractors [15.417833307088637]
Continual Few-shot Relation Extraction (CFRE) is a practical problem that requires the model to continuously learn novel relations. The primary challenges are catastrophic forgetting and overfitting. This paper harnesses prompt learning to explore the implicit capabilities of pre-trained language models.
arXiv Detail & Related papers (2024-02-24T04:32:44Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)
Preventing Catastrophic Forgetting in Continual Learning of New Natural Language Tasks [17.879087904904935]
Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model. As systems usually evolve over time, adding a new task to an existing MTL model usually requires retraining the model from scratch on all the tasks. In this paper, we approach the problem of incrementally expanding MTL models' capability to solve new tasks over time by distilling the knowledge of an already trained model on n tasks into a new one for solving n+1 tasks.
arXiv Detail & Related papers (2023-02-22T00:18:25Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Effective and Efficient Training for Sequential Recommendation using Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective. We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z)
BERT WEAVER: Using WEight AVERaging to enable lifelong learning for transformer-based models in biomedical semantic search engines [49.75878234192369]
We present WEAVER, a simple, yet efficient post-processing method that infuses old knowledge into the new model. We show that applying WEAVER in a sequential manner results in similar word embedding distributions as doing a combined training on all data at once.
arXiv Detail & Related papers (2022-02-21T10:34:41Z)
Lifelong Learning of Few-shot Learners across NLP Tasks [45.273018249235705]
We study the challenge of lifelong learning to few-shot learn over a sequence of diverse NLP tasks. We propose a continual meta-learning approach which learns to generate adapter weights from a few examples. We demonstrate our approach preserves model performance over training tasks and leads to positive knowledge transfer when the future tasks are learned.
arXiv Detail & Related papers (2021-04-18T10:41:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.