Rethinking Layer-wise Model Merging through Chain of Merges
- URL: http://arxiv.org/abs/2508.21421v2
- Date: Wed, 01 Oct 2025 11:54:45 GMT
- Title: Rethinking Layer-wise Model Merging through Chain of Merges
- Authors: Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara,
- Abstract summary: Chain of Merges (CoM) is a layer-wise merging procedure that sequentially merges weights across layers while sequentially updating activation statistics.<n> Experiments on standard benchmarks demonstrate that CoM achieves state-of-the-art performance.
- Score: 21.26982153528304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized models increases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques operate at the level of individual layers, thereby overlooking the inter-layer dependencies inherent in deep networks. We show that this simplification leads to distributional mismatches, particularly in methods that rely on intermediate activations, as changes in early layers are not properly propagated to downstream layers during merging. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address this, we propose Chain of Merges (CoM), a layer-wise merging procedure that sequentially merges weights across layers while sequentially updating activation statistics. By explicitly accounting for inter-layer interactions, CoM mitigates covariate shift and produces a coherent merged model through a series of conditionally optimal updates. Experiments on standard benchmarks demonstrate that CoM achieves state-of-the-art performance.
Related papers
- SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training [54.8494905524997]
Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes.<n>We propose SENTINEL, a verification mechanism for pipeline parallelism (PP) training without duplication.<n>Experiments demonstrate successful training of up to 4B- parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.
arXiv Detail & Related papers (2026-03-03T23:51:10Z) - Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations [55.047454145941366]
Streaming Merging is an innovative model updating paradigm that conceptualizes merging as an iterative optimization process.<n> ARM is a strategy designed to approximate gradient descent dynamics.<n> ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model.
arXiv Detail & Related papers (2026-02-03T08:15:57Z) - HyFormer: Revisiting the Roles of Sequence Modeling and Feature Interaction in CTR Prediction [8.97787361529607]
This paper presents HyFormer, a unified hybrid transformer architecture that tightly integrates long-sequence modeling and feature interaction into a single backbone.<n>Experiments on billion-scale industrial datasets demonstrate that HyFormer consistently outperforms strong LONGER and RankMixer baselines.
arXiv Detail & Related papers (2026-01-19T02:55:05Z) - DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting [14.176801586961286]
Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales.<n>We propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE)<n>EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities.<n>TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer's decomposed representations.
arXiv Detail & Related papers (2025-08-03T13:11:52Z) - Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction [55.914891182214475]
We introduce neural network reprogrammability as a unifying framework for model adaptation.<n>We present a taxonomy that categorizes such information manipulation approaches across four key dimensions.<n>We also analyze remaining technical challenges and ethical considerations.
arXiv Detail & Related papers (2025-06-05T05:42:27Z) - Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging [75.93960998357812]
Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their capabilities across different tasks and domains.<n>Current model merging techniques focus on merging all available models simultaneously, with weight matrices-based methods being the predominant approaches.<n>We propose a training-free projection-based continual merging method that processes models sequentially.
arXiv Detail & Related papers (2025-01-16T13:17:24Z) - Collective Model Intelligence Requires Compatible Specialization [29.590052023903457]
We show that as models specialize, the similarity in their feature space structure diminishes, hindering their capacity for collective use.
We propose a new direction for achieving collective model intelligence through what we call compatible specialization.
arXiv Detail & Related papers (2024-11-04T15:59:16Z) - Vanishing Feature: Diagnosing Model Merging and Beyond [1.1510009152620668]
We identify the vanishing feature'' phenomenon, where input-induced features diminish during propagation through a merged model.<n>We show that existing normalization strategies can be enhanced by precisely targeting the vanishing feature issue.<n>We propose the Preserve-First Merging'' (PFM) strategy, which focuses on preserving early-layer features.
arXiv Detail & Related papers (2024-02-05T17:06:26Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - Semantic Correspondence with Transformers [68.37049687360705]
We propose Cost Aggregation with Transformers (CATs) to find dense correspondences between semantically similar images.
We include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation.
We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies.
arXiv Detail & Related papers (2021-06-04T14:39:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.