Related papers: DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

URL: http://arxiv.org/abs/2406.11617v1
Date: Mon, 17 Jun 2024 15:02:45 GMT
Title: DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling
Authors: Pala Tej Deep, Rishabh Bhardwaj, Soujanya Poria,
Abstract summary: We propose a new model merging technique, Drop and rEscaLe via sampLing with mAgnitude (DELLA-Merging), that employs a novel pruning technique, MAGPRUNE. MAGPRUNE first ranks the parameters in order of their magnitude and assigns higher dropout probabilities (p) to parameters with lower ranks corresponding to lower magnitudes.
Score: 24.270321913746233
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the proliferation of domain-specific models, model merging has emerged as a set of techniques that combine the capabilities of multiple models into one that can multitask without the cost of additional training. In this paper, we propose a new model merging technique, Drop and rEscaLe via sampLing with mAgnitude (DELLA-Merging), that employs a novel pruning technique, MAGPRUNE, which shows significant advantages over DARE and TIES. MAGPRUNE first ranks the parameters in order of their magnitude and assigns higher dropout probabilities (p) to parameters with lower ranks corresponding to lower magnitudes. To approximate the original embeddings, MAGPRUNE employs a rescaling operation on the parameters that survive the random dropping by 1/(1 - p). On three different expert models considered for merging (LM, Math, Code) and corresponding benchmark datasets (AlpacaEval, GSM8K, MBPP), DELLA shows an average improvement of 2.4 points over baseline methods employing delta parameter pruning (an improvement of 3.6 points over TIES, 1.2 points over DARE), and 11.1 points over the no-pruning baseline (TA). We release the source code at: https://github.com/declare-lab/della.

Related papers

Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks. By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization [16.420834802431536]
We introduce Frank-Wolfe Merging (FW-Merging) as a constrained optimization problem. FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead.
arXiv Detail & Related papers (2025-03-16T21:07:05Z)
FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion [32.0871035771324]
We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs. For target models, we focus on three widely-used smaller variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-72B-Instruct. The resulting FuseChat-3.0 models exhibit significant performance gains across tasks such as instruction following, general knowledge, mathematics, and coding.
arXiv Detail & Related papers (2025-03-06T09:03:36Z)
S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation. S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs [44.03692512352445]
Column-Level Adaptive weight Quantization (CLAQ) is a novel and effective framework for Large Language Models (LLMs) quantization. In this paper, we present a novel and effective CLAQ framework by introducing three different types of adaptive strategies for LLM quantization. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings.
arXiv Detail & Related papers (2024-05-27T14:49:39Z)
EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z)
Rethinking Few-shot 3D Point Cloud Semantic Segmentation [62.80639841429669]
This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS) We focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution. To address these issues, we introduce a standardized FS-PCS setting, upon which a new benchmark is built.
arXiv Detail & Related papers (2024-03-01T15:14:47Z)
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch [72.97553348776425]
We introduce DARE to set most delta parameters without affecting the abilities of Supervised Fine-Tuning (SFT) LMs. Then, we use DARE as a versatile plug-in to sparsify delta parameters of multiple SFT models and merge them into a single model. We also utilize DARE to create a merged LM that ranks first among models with 7 billion parameters on the Open Leaderboard.
arXiv Detail & Related papers (2023-11-06T13:43:07Z)
TIES-Merging: Resolving Interference When Merging Models [95.59265307318752]
Transfer learning can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. Model merging has emerged as a solution to combine multiple task-specific models into a single model without performing additional training. Existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. We propose TIES-Merging, which introduces three novel steps when merging models: resetting parameters that only changed a small amount during fine-tuning, resolving sign conflicts, and merging only the parameters that are in alignment with the final agreed-upon sign.
arXiv Detail & Related papers (2023-06-02T17:31:32Z)
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z)
BERMo: What can BERT learn from ELMo? [6.417011237981518]
We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
arXiv Detail & Related papers (2021-10-18T17:35:41Z)
Bayesian multiscale deep generative model for the solution of high-dimensional inverse problems [0.0]
A novel multiscale Bayesian inference approach is introduced based on deep probabilistic generative models. The method allows high-dimensional parameter estimation while exhibiting stability, efficiency and accuracy.
arXiv Detail & Related papers (2021-02-04T11:47:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.