One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging
- URL: http://arxiv.org/abs/2508.06163v1
- Date: Fri, 08 Aug 2025 09:33:08 GMT
- Title: One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging
- Authors: Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, Jingbo Zhu,
- Abstract summary: Key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference.<n>We introduce textbfTADrop (textbfTensor-wise textbfAdaptive textbfDrop)<n>Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties.
- Score: 44.5685148449294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model merging has emerged as a compelling data-free paradigm for multi-task learning, enabling the fusion of multiple fine-tuned models into a single, powerful entity. A key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference. However, prevailing approaches employ a ``one-size-fits-all'' strategy, applying a uniform sparsity ratio that overlooks the inherent structural and statistical heterogeneity of model parameters. This often leads to a suboptimal trade-off, where critical parameters are inadvertently pruned while less useful ones are retained. To address this limitation, we introduce \textbf{TADrop} (\textbf{T}ensor-wise \textbf{A}daptive \textbf{Drop}), an adaptive sparsification strategy that respects this heterogeneity. Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties. The core intuition is that tensors with denser, more redundant distributions can be pruned aggressively, while sparser, more critical ones are preserved. As a simple and plug-and-play module, we validate TADrop by integrating it with foundational, classic, and SOTA merging methods. Extensive experiments across diverse tasks (vision, language, and multimodal) and models (ViT, BEiT) demonstrate that TADrop consistently and significantly boosts their performance. For instance, when enhancing a leading merging method, it achieves an average performance gain of 2.0\% across 8 ViT-B/32 tasks. TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model's structure, offering a new baseline for high-performance model merging.
Related papers
- Model Merging in the Essential Subspace [78.5390284258307]
Model merging aims to integrate multiple task-specific fine-tuned models into a single multi-task model without additional training.<n>Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models.<n>We propose ESM (Essential Subspace Merging), a robust framework for effective model merging.
arXiv Detail & Related papers (2026-02-23T00:33:38Z) - Model Merging via Multi-Teacher Knowledge Distillation [11.543771846135021]
We introduce a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting.<n>We frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data.<n>We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk.
arXiv Detail & Related papers (2025-12-24T17:10:44Z) - NAN: A Training-Free Solution to Coefficient Estimation in Model Merging [61.36020737229637]
We show that the optimal merging weights should scale with the amount of task-specific information encoded in each model.<n>We propose NAN, a simple yet effective method that estimates model merging coefficients via the inverse of parameter norm.<n>NAN is training-free, plug-and-play, and applicable to a wide range of merging strategies.
arXiv Detail & Related papers (2025-05-22T02:46:08Z) - Dynamic Fisher-weighted Model Merging via Bayesian Optimization [37.02810891820468]
Existing merging approaches typically involve scaling the parameters model-wise or integrating parameter importance parameter-wise.<n>We unify these strategies into a more general merging framework, and introduce Dynamic Fisher-weighted Merging (DF-Merge)<n>We show that DF-Merge outperforms strong baselines across models of different sizes and a variety of tasks.
arXiv Detail & Related papers (2025-04-26T18:31:14Z) - Parameter Efficient Merging for Multimodal Large Language Models with Complementary Parameter Adaptation [17.39117429338763]
We propose CoPA-Merging, a training-free parameter efficient merging method with complementary parameter adaptation.<n>We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certificate the outstanding performance and generalizability of our method.
arXiv Detail & Related papers (2025-02-24T13:52:05Z) - Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent [72.10987117380584]
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data.<n>We find existing methods discard task-specific information that, while causing conflicts, is crucial for performance.<n>Our approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
arXiv Detail & Related papers (2025-01-02T12:45:21Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Parameter Competition Balancing for Model Merging [13.66727853299506]
PCB-Merging is a training-free technique that adjusts the coefficients of each parameter for effective model merging.
PCB-Merging achieves substantial performance enhancements across multiple modalities, domains, model sizes, number of tasks, fine-tuning forms, and large language models.
arXiv Detail & Related papers (2024-10-03T11:17:58Z) - TIES-Merging: Resolving Interference When Merging Models [95.59265307318752]
Transfer learning can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency.
Model merging has emerged as a solution to combine multiple task-specific models into a single model without performing additional training.
Existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models.
We propose TIES-Merging, which introduces three novel steps when merging models: resetting parameters that only changed a small amount during fine-tuning, resolving sign conflicts, and merging only the parameters that are in alignment with the final agreed-upon sign.
arXiv Detail & Related papers (2023-06-02T17:31:32Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.