Merging Models with Fisher-Weighted Averaging
- URL: http://arxiv.org/abs/2111.09832v1
- Date: Thu, 18 Nov 2021 17:59:35 GMT
- Title: Merging Models with Fisher-Weighted Averaging
- Authors: Michael Matena and Colin Raffel
- Abstract summary: We introduce a fundamentally different method for transferring knowledge across models that amounts to "merging" multiple models into one.
Our approach effectively involves computing a weighted average of the models' parameters.
We show that our merging procedure makes it possible to combine models in previously unexplored ways.
- Score: 24.698591753644077
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transfer learning provides a way of leveraging knowledge from one task when
learning another task. Performing transfer learning typically involves
iteratively updating a model's parameters through gradient descent on a
training dataset. In this paper, we introduce a fundamentally different method
for transferring knowledge across models that amounts to "merging" multiple
models into one. Our approach effectively involves computing a weighted average
of the models' parameters. We show that this averaging is equivalent to
approximately sampling from the posteriors of the model weights. While using an
isotropic Gaussian approximation works well in some cases, we also demonstrate
benefits by approximating the precision matrix via the Fisher information. In
sum, our approach makes it possible to combine the "knowledge" in multiple
models at an extremely low computational cost compared to standard
gradient-based training. We demonstrate that model merging achieves comparable
performance to gradient descent-based transfer learning on intermediate-task
training and domain adaptation problems. We also show that our merging
procedure makes it possible to combine models in previously unexplored ways. To
measure the robustness of our approach, we perform an extensive ablation on the
design of our algorithm.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - Fisher Mask Nodes for Language Model Merging [0.0]
We introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning.
Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost.
arXiv Detail & Related papers (2024-03-14T21:52:26Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - Deep Unfolding-based Weighted Averaging for Federated Learning in
Heterogeneous Environments [11.023081396326507]
Federated learning is a collaborative model training method that iterates model updates by multiple clients and aggregation of the updates by a central server.
To adjust the aggregation weights, this paper employs deep unfolding, which is known as the parameter tuning method.
The proposed method can handle large-scale learning models with the aid of pretrained models such as it can perform practical real-world tasks.
arXiv Detail & Related papers (2022-12-23T08:20:37Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Transfer Learning with Gaussian Processes for Bayesian Optimization [9.933956770453438]
We provide a unified view on hierarchical GP models for transfer learning, which allows us to analyze the relationship between methods.
We develop a novel closed-form boosted GP transfer model that fits between existing approaches in terms of complexity.
We evaluate the performance of the different approaches in large-scale experiments and highlight strengths and weaknesses of the different transfer-learning methods.
arXiv Detail & Related papers (2021-11-22T14:09:45Z) - Distilling Interpretable Models into Human-Readable Code [71.11328360614479]
Human-readability is an important and desirable standard for machine-learned model interpretability.
We propose to train interpretable models using conventional methods, and then distill them into concise, human-readable code.
We describe a piecewise-linear curve-fitting algorithm that produces high-quality results efficiently and reliably across a broad range of use cases.
arXiv Detail & Related papers (2021-01-21T01:46:36Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.