Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks
- URL: http://arxiv.org/abs/2410.01483v2
- Date: Fri, 15 Aug 2025 11:54:42 GMT
- Title: Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks
- Authors: Edan Kinderman, Itay Hubara, Haggai Maron, Daniel Soudry,
- Abstract summary: We show that traditional merging methods fail catastrophically in this setup.<n>We introduce "Foldable SuperNet" (FS-Merge) which trains a SuperNet containing the original models.<n>After training, the SuperNet is folded back to the size of a single original model.
- Score: 31.962161747846114
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks into a single multi-task model. While most works focus on the simpler setup of merging NNs initialized from a common pre-trained network, we target the harder problem of merging large transformers trained on different tasks from distinct initializations. We show that traditional merging methods fail catastrophically in this setup, while Knowledge Distillation (KD) achieves much better results, though at a higher cost. However, KD is data-inefficient, as it does not exploit the original models' weights. To solve this, we introduce "Foldable SuperNet Merge" (FS-Merge), which trains a SuperNet containing the original models (with frozen weights) using a feature reconstruction objective. After training, the SuperNet is folded back to the size of a single original model. FS-Merge is simple, data-efficient, has a computational cost comparable to KD, and is proven to have superior expressiveness compared to traditional merging methods on MLP models. It achieves SOTA results when tested on MLPs and transformers across various sizes, tasks, modalities, and distribution shifts, especially in low-data scenarios.
Related papers
- LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging [80.17238673443127]
LiNeS is a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance.
LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing.
arXiv Detail & Related papers (2024-10-22T16:26:05Z) - FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models [35.40065954148091]
FINE is a method based on the Learngene framework to initializing downstream networks leveraging pre-trained models.
It decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as learngenes''
It consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes.
arXiv Detail & Related papers (2024-09-28T08:57:17Z) - Merging Vision Transformers from Different Tasks and Domains [46.40701388197936]
This work targets to merge various Vision Transformers (ViTs) trained on different tasks (i.e., datasets with different object categories) or domains (i.e., datasets with the same categories but different environments) into one unified model.
Previous model merging works focus on either CNNs or NLP models, leaving the ViTs merging research untouched.
arXiv Detail & Related papers (2023-12-25T09:32:28Z) - Efficient Stitchable Task Adaptation [47.94819192325723]
We present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models.
Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches.
We streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy.
arXiv Detail & Related papers (2023-11-29T04:31:35Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - Factorized Tensor Networks for Multi-Task and Multi-Domain Learning [17.618186852259015]
We propose a factorized tensor network (FTN) that can achieve accuracy comparable to independent single-task/domain networks.
FTN requires a significantly smaller number of task-specific parameters compared to existing methods.
We show the experiments on convolutional-based architecture with different backbones and on transformer-based architecture.
arXiv Detail & Related papers (2023-10-09T19:59:59Z) - Parameter Efficient Multi-task Model Fusion with Partial Linearization [97.23530944186078]
We propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques.
Our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters.
We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model.
arXiv Detail & Related papers (2023-10-07T08:55:54Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression
For On-device ASR Models [30.758876520227666]
TODM is a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job.
We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet.
Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER)
arXiv Detail & Related papers (2023-09-05T04:47:55Z) - Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery
Tickets from Large Models [106.19385911520652]
Lottery Ticket Hypothesis (LTH) and its variants have been exploited to prune large pre-trained models generating parameterworks.
LTH is enormously inhibited by repetitive full training and pruning routine of iterative magnitude pruning (IMP)
We propose Instant Soup Pruning (ISP) to generate lottery ticket quality IMPworks.
arXiv Detail & Related papers (2023-06-18T03:09:52Z) - Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts [55.470959564665705]
Weight-sharing supernets are crucial for performance estimation in cutting-edge neural search frameworks.
The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models.
It excels in NAS for building memory-efficient task-agnostic BERT models.
arXiv Detail & Related papers (2023-06-08T00:35:36Z) - Stitchable Neural Networks [40.8842135978138]
We present Stitchable Neural Networks (SN-Net), a novel scalable and efficient framework for model deployment.
SN-Net splits the anchors across the blocks/layers and then stitches them together with simple stitching layers to map the activations from one anchor to another.
Experiments on ImageNet classification demonstrate that SN-Net can obtain on-par or even better performance than many individually trained networks.
arXiv Detail & Related papers (2023-02-13T18:37:37Z) - Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint.
We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z) - SuperShaper: Task-Agnostic Super Pre-training of BERT Models with
Variable Hidden Dimensions [2.8583189395674653]
SuperShaper is a task agnostic pre-training approach for NLU models.
It simultaneously pre-trains a large number of Transformer models by varying shapes.
SuperShaper discovers networks that effectively trade-off accuracy and model size.
arXiv Detail & Related papers (2021-10-10T05:44:02Z) - What's Hidden in a One-layer Randomly Weighted Transformer? [100.98342094831334]
Hidden within one-layer randomly weighted neural networks, there existworks that can achieve impressive performance.
Using a fixed pre-trained embedding layer, the previously foundworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14.
arXiv Detail & Related papers (2021-09-08T21:22:52Z) - Transfer Learning for Sequence Generation: from Single-source to
Multi-source [50.34044254589968]
We propose a two-stage finetuning method to alleviate the pretrain-finetune discrepancy and introduce a novel MSG model with a fine encoder to learn better representations in MSG tasks.
Our approach achieves new state-of-the-art results on the WMT17 APE task and multi-source translation task using the WMT14 test set.
arXiv Detail & Related papers (2021-05-31T09:12:38Z) - MutualNet: Adaptive ConvNet via Mutual Learning from Different Model
Configurations [51.85020143716815]
We propose MutualNet to train a single network that can run at a diverse set of resource constraints.
Our method trains a cohort of model configurations with various network widths and input resolutions.
MutualNet is a general training methodology that can be applied to various network structures.
arXiv Detail & Related papers (2021-05-14T22:30:13Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.