Rethinking Weight-Averaged Model-merging
- URL: http://arxiv.org/abs/2411.09263v5
- Date: Tue, 19 Aug 2025 12:07:46 GMT
- Title: Rethinking Weight-Averaged Model-merging
- Authors: Hu Wang, Congbo Ma, Ibrahim Almakky, Ian Reid, Gustavo Carneiro, Mohammad Yaqub,
- Abstract summary: Model merging, particularly through weight averaging, has shown surprising effectiveness in saving computations and improving model performance without any additional training.<n>In this work, we reinterpret weight-averaged model merging through the lens of interpretability and provide empirical insights into the underlying mechanisms that govern its behavior.
- Score: 15.2881959315021
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Model merging, particularly through weight averaging, has shown surprising effectiveness in saving computations and improving model performance without any additional training. However, the interpretability of why and how this technique works remains unclear. In this work, we reinterpret weight-averaged model merging through the lens of interpretability and provide empirical insights into the underlying mechanisms that govern its behavior. We approach the problem from three perspectives: (1) we analyze the learned weight structures and demonstrate that model weights encode structured representations that help explain the compatibility of weight averaging; (2) we compare averaging in weight space and feature space across diverse model architectures (CNNs and ViTs) and datasets, aiming to expose under which circumstances what combination paradigm will work more effectively; (3) we study the effect of parameter scaling on prediction stability, highlighting how weight averaging acts as a form of regularization that contributes to robustness. By framing these analyses in an interpretability context, our work contributes to a more transparent and systematic understanding of model merging for stakeholders interested in the safety and reliability of untrained model combination methods. The code is available at https://github.com/billhhh/Rethink-Merge.
Related papers
- Revisiting Model Interpolation for Efficient Reasoning [27.32667995137936]
We revisit the simplest merging method that interpolates two weights directly.<n>We observe that model follows a three-stage evolutionary paradigm with distinct behaviors on the reasoning trajectory.
arXiv Detail & Related papers (2025-10-13T03:30:01Z) - Learning Compact Representations of LLM Abilities via Item Response Theory [35.74367665390977]
We explore how to learn compact representations of large language models (LLMs)<n>We frame this problem as estimating the probability that a given model will correctly answer a specific query.<n>To learn these parameters jointly, we introduce a Mixture-of-Experts (MoE) network that couples model- and query-level embeddings.
arXiv Detail & Related papers (2025-10-01T12:55:34Z) - RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models [15.62650736139546]
We classify eight fine-tuning methods into three mechanisms: weight-based, representation-based, and partial fine-tuning.<n>We benchmark these methods on downstream regression and classification tasks across supervised and self-supervised pre-trained models in diverse labeling settings.<n>This evaluation provides valuable insights and informs the design of a refined robust fine-tuning method, ROFT-MOL.
arXiv Detail & Related papers (2025-08-30T21:35:57Z) - LeMoRe: Learn More Details for Lightweight Semantic Segmentation [48.81126061219231]
We introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity.<n>Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies.
arXiv Detail & Related papers (2025-05-29T04:55:10Z) - Multi-Level Collaboration in Model Merging [56.31088116526825]
This paper explores the intrinsic connections between model merging and model ensembling.<n>We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling.
arXiv Detail & Related papers (2025-03-03T07:45:04Z) - A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - Exploring Model Kinship for Merging Large Language Models [52.01652098827454]
We introduce model kinship, the degree of similarity or relatedness between Large Language Models.
We find that there is a certain relationship between model kinship and the performance gains after model merging.
We propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets.
arXiv Detail & Related papers (2024-10-16T14:29:29Z) - What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model.
Previous studies have primarily focused on merging a few small models.
This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z) - Weight Scope Alignment: A Frustratingly Easy Method for Model Merging [40.080926444789085]
Non-I.I.D. data poses a huge challenge for averaging-based model fusion.
In this paper, we reveal variations in weight scope under different training conditions, shedding light on its influence on model merging.
Fortunately, the parameters in each layer basically follow the Gaussian distribution, which inspires a novel and simple regularization approach.
arXiv Detail & Related papers (2024-08-22T09:13:27Z) - Learning-based Models for Vulnerability Detection: An Extensive Study [3.1317409221921144]
We extensively and comprehensively investigate two types of state-of-the-art learning-based approaches.
We experimentally demonstrate the priority of sequence-based models and the limited abilities of both graph-based models.
arXiv Detail & Related papers (2024-08-14T13:01:30Z) - OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance [67.37017498784748]
Large-scale 3D parallel training on vision-language instruction-tuning models leads to an imbalanced computation load across different devices.<n>We rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices.<n>Our method's efficacy and generalizability are further validated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Has Your Pretrained Model Improved? A Multi-head Posterior Based
Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models.
We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models.
Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z) - Aggregation Weighting of Federated Learning via Generalization Bound
Estimation [65.8630966842025]
Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions.
We replace the aforementioned weighting method with a new strategy that considers the generalization bounds of each local model.
arXiv Detail & Related papers (2023-11-10T08:50:28Z) - Model Merging by Uncertainty-Based Gradient Matching [70.54580972266096]
We propose a new uncertainty-based scheme to improve the performance by reducing the mismatch.
Our new method gives consistent improvements for large language models and vision transformers.
arXiv Detail & Related papers (2023-10-19T15:02:45Z) - Revisiting Implicit Models: Sparsity Trade-offs Capability in
Weight-tied Model for Vision Tasks [4.872984658007499]
Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models.
We revisit the line of implicit models and trace them back to the original weight-tied models.
Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants.
arXiv Detail & Related papers (2023-07-16T11:45:35Z) - The Effect of Balancing Methods on Model Behavior in Imbalanced
Classification Problems [4.370097023410272]
Imbalanced data poses a challenge in classification as model performance is affected by insufficient learning from minority classes.
This study addresses a more challenging aspect of balancing methods - their impact on model behavior.
To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing.
arXiv Detail & Related papers (2023-06-30T22:25:01Z) - Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth.
We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity.
Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z) - Representer Point Selection for Explaining Regularized High-dimensional
Models [105.75758452952357]
We introduce a class of sample-based explanations we term high-dimensional representers.
Our workhorse is a novel representer theorem for general regularized high-dimensional models.
We study the empirical performance of our proposed methods on three real-world binary classification datasets and two recommender system datasets.
arXiv Detail & Related papers (2023-05-31T16:23:58Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - Semantic Image Attack for Visual Model Diagnosis [80.36063332820568]
In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models.
This paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images.
arXiv Detail & Related papers (2023-03-23T03:13:04Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Meta-Ensemble Parameter Learning [35.6391802164328]
In this paper, we study if we can utilize the meta-learning strategy to directly predict the parameters of a single model with comparable performance of an ensemble.
We introduce WeightFormer, a Transformer-based model that can predict student network weights layer by layer in a forward pass.
arXiv Detail & Related papers (2022-10-05T00:47:24Z) - Merging Models with Fisher-Weighted Averaging [24.698591753644077]
We introduce a fundamentally different method for transferring knowledge across models that amounts to "merging" multiple models into one.
Our approach effectively involves computing a weighted average of the models' parameters.
We show that our merging procedure makes it possible to combine models in previously unexplored ways.
arXiv Detail & Related papers (2021-11-18T17:59:35Z) - Distributional Depth-Based Estimation of Object Articulation Models [21.046351215949525]
We propose a method that efficiently learns distributions over articulation model parameters directly from depth images.
Our core contributions include a novel representation for distributions over rigid body transformations.
We introduce a novel deep learning based approach, DUST-net, that performs category-independent articulation model estimation.
arXiv Detail & Related papers (2021-08-12T17:44:51Z) - Physics-Integrated Variational Autoencoders for Robust and Interpretable
Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models.
We propose a VAE architecture in which a part of the latent space is grounded by physics.
We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z) - Structured learning of rigid-body dynamics: A survey and unified view
from a robotics perspective [5.597839822252915]
We study supervised regression models that combine rigid-body mechanics with data-driven modelling techniques.
We provide a unified view on the combination of data-driven regression models, such as neural networks and Gaussian processes, with analytical model priors.
arXiv Detail & Related papers (2020-12-11T11:26:48Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z) - A Semiparametric Approach to Interpretable Machine Learning [9.87381939016363]
Black box models in machine learning have demonstrated excellent predictive performance in complex problems and high-dimensional settings.
Their lack of transparency and interpretability restrict the applicability of such models in critical decision-making processes.
We propose a novel approach to trading off interpretability and performance in prediction models using ideas from semiparametric statistics.
arXiv Detail & Related papers (2020-06-08T16:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.