LM-Cocktail: Resilient Tuning of Language Models via Model Merging
- URL: http://arxiv.org/abs/2311.13534v4
- Date: Fri, 8 Dec 2023 16:45:22 GMT
- Title: LM-Cocktail: Resilient Tuning of Language Models via Model Merging
- Authors: Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing
- Abstract summary: We propose LM-Cocktail which enables the fine-tuned model to stay resilient in general perspectives.
Our method is conducted in the form of model merging.
We conduct comprehensive experiments with LLama and BGE model on popular benchmarks.
- Score: 8.479219617263498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The pre-trained language models are continually fine-tuned to better support
downstream applications. However, this operation may result in significant
performance degeneration on general tasks beyond the targeted domain. To
overcome this problem, we propose LM-Cocktail which enables the fine-tuned
model to stay resilient in general perspectives. Our method is conducted in the
form of model merging, where the fine-tuned language model is merged with the
pre-trained base model or the peer models from other domains through weighted
average. Despite simplicity, LM-Cocktail is surprisingly effective: the
resulted model is able to achieve a strong empirical performance in the whole
scope of general tasks while preserving a superior capacity in its targeted
domain. We conduct comprehensive experiments with LLama and BGE model on
popular benchmarks, including FLAN, MMLU, MTEB, whose results validate the
efficacy of our proposed method. The code and checkpoints are available at
https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail.
Related papers
- CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference [33.871080938643566]
Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead.
We propose CMoE, a novel framework to efficiently carve MoE models from dense models.
CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation.
arXiv Detail & Related papers (2025-02-06T14:05:30Z) - Towards Compatible Fine-tuning for Vision-Language Model Updates [114.25776195225494]
Class-conditioned Context Optimization (ContCoOp) integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder.
Our experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.
arXiv Detail & Related papers (2024-12-30T12:06:27Z) - A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - Exploring Model Kinship for Merging Large Language Models [52.01652098827454]
We introduce model kinship, the degree of similarity or relatedness between Large Language Models.
We find that there is a certain relationship between model kinship and the performance gains after model merging.
We propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets.
arXiv Detail & Related papers (2024-10-16T14:29:29Z) - Mitigating Catastrophic Forgetting in Language Transfer via Model Merging [16.845734486667226]
Branch-and-Merge (BaM) is a new adaptation method based on iteratively merging multiple models.
BaM is based on the insight that this yields lower magnitude but higher quality weight changes.
We demonstrate in an empirical study on Bulgarian and German that BaM can significantly reduce forgetting while matching or even improving target domain performance.
arXiv Detail & Related papers (2024-07-11T17:32:40Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Mafin: Enhancing Black-Box Embeddings with Model Augmented Fine-Tuning [13.211063836237468]
We introduce Model augmented fine-tuning (Mafin) -- a novel approach for fine-tuning a black-box embedding model by augmenting it with a trainable embedding model.
Our results demonstrate that Mafin significantly enhances the performance of the black-box embeddings by only requiring the training of a small augmented model.
arXiv Detail & Related papers (2024-02-19T14:33:24Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning [23.726336635748783]
Federated learning aims to collaboratively train a strong global model by accessing users' locally trained models but not their own data.
A crucial step is therefore to aggregate local models into a global model, which has been shown challenging when users have non-i.i.d. data.
We propose a novel aggregation algorithm named FedBE, which takes a Bayesian inference perspective by sampling higher-quality global models.
arXiv Detail & Related papers (2020-09-04T01:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.