If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
- URL: http://arxiv.org/abs/2412.04144v3
- Date: Mon, 03 Feb 2025 20:31:39 GMT
- Title: If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
- Authors: Muhammad Khalifa, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé,
- Abstract summary: We study merging "generalist" models trained on many tasks.<n>Our algorithm tunes the weight of each checkpoint in a linear combination, resulting in an optimal model.<n>Good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.
- Score: 48.95875673503714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging "generalist" models trained on many tasks. We explore merging in the context of large (~100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and the suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in such an optimal model that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.
Related papers
- Towards Reversible Model Merging For Low-rank Weights [5.100622189286672]
Model merging aims to combine multiple fine-tuned models into a single set of weights that performs well across all source tasks.<n>We show that applying conventional merging methods to low-rank weights leads to severe performance degradation in the merged model.<n>We propose a fundamentally different approach: instead of collapsing all adapters into one set of weights, we construct a compact basis.<n>This reframes merging as generating a reconstruction-capable model space rather than producing a single merged model.
arXiv Detail & Related papers (2025-10-15T23:22:38Z) - Merge and Guide: Unifying Model Merging and Guided Decoding for Controllable Multi-Objective Generation [49.98025799046136]
We introduce Merge-And-GuidE, a two-stage framework that leverages model merging for guided decoding.<n>In Stage 1, MAGE resolves a compatibility problem between the guidance and base models.<n>In Stage 2, we merge explicit and implicit value models into a unified guidance proxy, which then steers the decoding of the base model from Stage 1.
arXiv Detail & Related papers (2025-10-04T11:10:07Z) - Navigating the Accuracy-Size Trade-Off with Flexible Model Merging [15.497612580389479]
We propose FlexMerge, a novel data-free model merging framework.<n>It flexibly generates merged models of varying sizes, spanning the full spectrum from a single merged model to retaining all fine-tuned models.<n>Using FlexMerge, we systematically characterize the accuracy-size trade-off of different algorithms.
arXiv Detail & Related papers (2025-05-29T07:50:32Z) - Why Do More Experts Fail? A Theoretical Analysis of Model Merging [51.18155031364046]
Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model.<n>Recent model merging methods have shown promising results, but struggle to maintain performance gains as the number of merged models increases.<n>We show that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged.
arXiv Detail & Related papers (2025-05-27T14:10:46Z) - NAN: A Training-Free Solution to Coefficient Estimation in Model Merging [61.36020737229637]
We show that the optimal merging weights should scale with the amount of task-specific information encoded in each model.<n>We propose NAN, a simple yet effective method that estimates model merging coefficients via the inverse of parameter norm.<n>NAN is training-free, plug-and-play, and applicable to a wide range of merging strategies.
arXiv Detail & Related papers (2025-05-22T02:46:08Z) - Dynamic Fisher-weighted Model Merging via Bayesian Optimization [37.02810891820468]
Existing merging approaches typically involve scaling the parameters model-wise or integrating parameter importance parameter-wise.
We unify these strategies into a more general merging framework, and introduce Dynamic Fisher-weighted Merging (DF-Merge)
We show that DF-Merge outperforms strong baselines across models of different sizes and a variety of tasks.
arXiv Detail & Related papers (2025-04-26T18:31:14Z) - Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging [2.9761595094633435]
Checkpoint merging is a technique for combining multiple model snapshots into a single superior model.
This paper explores checkpoint merging in the context of parameter-efficient fine-tuning.
We propose Metrics-Weighted Averaging (MWA) to merge model checkpoints by weighting their parameters according to performance metrics.
arXiv Detail & Related papers (2025-04-23T05:11:21Z) - What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model.
Previous studies have primarily focused on merging a few small models.
This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z) - Pareto Merging: Multi-Objective Optimization for Preference-Aware Model Merging [11.186194228460273]
We propose a preference-aware model merging problem in which the performance of the merged model on each base model's task is treated as an objective.
We show that the proposed model merging produces diverse trade-off models and achieves higher test accuracy compared to state-of-the-art merging baselines.
arXiv Detail & Related papers (2024-08-22T03:41:14Z) - Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis [17.989809995141044]
We propose CCA Merge, which is based on Corre Analysis Analysis.
We show that CCA works significantly better than past methods when more than 2 models are merged.
arXiv Detail & Related papers (2024-07-07T14:21:04Z) - PLeaS -- Merging Models with Permutations and Least Squares [43.17620198572947]
We propose a new two-step algorithm to merge models-termed PLeaS.
PLeaS partially matches nodes in each layer by maximizing alignment.
It computes the weights of the merged model as a layer-wise Least Squares solution.
arXiv Detail & Related papers (2024-07-02T17:24:04Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Merging by Matching Models in Task Parameter Subspaces [87.8712523378141]
Model merging aims to cheaply combine individual task-specific models into a single multitask model.
We formalize how this approach to model merging can be seen as solving a linear system of equations.
We show that using the conjugate gradient method can outperform closed-form solutions.
arXiv Detail & Related papers (2023-12-07T14:59:15Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.