Vanishing Feature: Diagnosing Model Merging and Beyond
- URL: http://arxiv.org/abs/2402.05966v3
- Date: Sat, 04 Jan 2025 12:51:11 GMT
- Title: Vanishing Feature: Diagnosing Model Merging and Beyond
- Authors: Xingyu Qu, Samuel Horvath,
- Abstract summary: We identify the vanishing feature'' phenomenon, where input-induced features diminish during propagation through a merged model.
We show that existing normalization strategies can be enhanced by precisely targeting the vanishing feature issue.
We propose the Preserve-First Merging'' (PFM) strategy, which focuses on preserving early-layer features.
- Score: 1.1510009152620668
- License:
- Abstract: Model merging offers an efficient way to combine pre-trained neural networks but often suffers from inconsistent performance, especially when merging models with different initializations. We identify the ``vanishing feature'' phenomenon, where input-induced features diminish during propagation through the merged model, degrading performance. Through theoretical and empirical analysis, we reveal that this phenomenon underpins challenges like variance collapse and explains techniques like permutation-based merging, post-merging normalization, etc. We show that existing normalization strategies can be enhanced by precisely targeting the vanishing feature issue. Leveraging these insights, we propose the ``Preserve-First Merging'' (PFM) strategy, which focuses on preserving early-layer features, enabling the merged models, for the first time, to outperform the original models in advanced settings without post-training. Furthermore, we demonstrate that the vanishing feature phenomenon extends to other contexts, such as model pruning. Applying post-pruning normalization to mitigate the issue significantly improves one-shot pruning performance at high sparsity, offering a simple and effective post-pruning solution. The code is available at https://github.com/XingyuQu/VF.
Related papers
- ConsistentFeature: A Plug-and-Play Component for Neural Network Regularization [0.32885740436059047]
Over- parameterized neural network models often lead to significant performance discrepancies between training and test sets.
We introduce a simple perspective on overfitting: models learn different representations in different i.i.d. datasets.
We propose an adaptive method, ConsistentFeature, that regularizes the model by constraining feature differences across random subsets of the same training set.
arXiv Detail & Related papers (2024-12-02T13:21:31Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals [14.741951369068877]
We find that ubiquitous time series (TS) forecasting models are prone to severe overfitting.
We introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method.
The proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.
arXiv Detail & Related papers (2024-02-04T03:54:31Z) - JoReS-Diff: Joint Retinex and Semantic Priors in Diffusion Model for Low-light Image Enhancement [69.6035373784027]
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models.
Previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy.
We propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition.
arXiv Detail & Related papers (2023-12-20T08:05:57Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging [24.64264715041198]
We introduce Sparse Model Soups (SMS), a novel method for merging sparse models by initiating each prune-retrain cycle with the averaged model from the previous phase.
SMS preserves sparsity, exploits sparse network benefits, is modular and fully parallelizable, and substantially improves IMP's performance.
arXiv Detail & Related papers (2023-06-29T08:49:41Z) - Exploiting Diffusion Prior for Real-World Image Super-Resolution [75.5898357277047]
We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution.
By employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model.
arXiv Detail & Related papers (2023-05-11T17:55:25Z) - Latent Autoregressive Source Separation [5.871054749661012]
This paper introduces vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models.
Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens.
arXiv Detail & Related papers (2023-01-09T17:32:00Z) - ClusterQ: Semantic Feature Distribution Alignment for Data-Free
Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ.
To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics.
We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z) - Adversarial and Contrastive Variational Autoencoder for Sequential
Recommendation [25.37244686572865]
We propose a novel method called Adversarial and Contrastive Variational Autoencoder (ACVAE) for sequential recommendation.
We first introduce the adversarial training for sequence generation under the Adversarial Variational Bayes framework, which enables our model to generate high-quality latent variables.
Besides, when encoding the sequence, we apply a recurrent and convolutional structure to capture global and local relationships in the sequence.
arXiv Detail & Related papers (2021-03-19T09:01:14Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.