Deep Model Assembling
- URL: http://arxiv.org/abs/2212.04129v1
- Date: Thu, 8 Dec 2022 08:04:06 GMT
- Title: Deep Model Assembling
- Authors: Zanlin Ni, Yulin Wang, Jiangwei Yu, Haojun Jiang, Yue Cao, Gao Huang
- Abstract summary: This paper studies a divide-and-conquer strategy to train large models.
It divides a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model.
We introduce a global, shared meta model to implicitly link all the modules together.
This enables us to train highly compatible modules that collaborate effectively when they are assembled together.
- Score: 31.88606253639418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large deep learning models have achieved remarkable success in many
scenarios. However, training large models is usually challenging, e.g., due to
the high computational cost, the unstable and painfully slow optimization
procedure, and the vulnerability to overfitting. To alleviate these problems,
this work studies a divide-and-conquer strategy, i.e., dividing a large model
into smaller modules, training them independently, and reassembling the trained
modules to obtain the target model. This approach is promising since it avoids
directly training large models from scratch. Nevertheless, implementing this
idea is non-trivial, as it is difficult to ensure the compatibility of the
independently trained modules. In this paper, we present an elegant solution to
address this issue, i.e., we introduce a global, shared meta model to
implicitly link all the modules together. This enables us to train highly
compatible modules that collaborate effectively when they are assembled
together. We further propose a module incubation mechanism that enables the
meta model to be designed as an extremely shallow network. As a result, the
additional overhead introduced by the meta model is minimalized. Though
conceptually simple, our method significantly outperforms end-to-end (E2E)
training in terms of both final accuracy and training efficiency. For example,
on top of ViT-Huge, it improves the accuracy by 2.7% compared to the E2E
baseline on ImageNet-1K, while saving the training cost by 43% in the meantime.
Code is available at https://github.com/LeapLabTHU/Model-Assembling.
Related papers
- Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost.
We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion.
By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z) - Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models [31.960749305728488]
We introduce a novel concept dubbed modular neural tangent kernel (mNTK)
We show that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $lambda_max$.
We propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $lambda_max$ exceeding a dynamic threshold.
arXiv Detail & Related papers (2024-05-13T07:46:48Z) - m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers [27.73393245438193]
We propose module-to-module knowledge distillation (m2mKD) for transferring knowledge between modules.
We evaluate m2mKD on two modular neural architectures: Neural Attentive Circuits (NACs) and Vision Mixture-of-Experts (V-MoE)
Applying m2mKD to NACs yields significant improvements in IID accuracy on Tiny-ImageNet and OOD robustness on Tiny-ImageNet-R.
arXiv Detail & Related papers (2024-02-26T04:47:32Z) - Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training [32.154166415680066]
Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones.
This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
arXiv Detail & Related papers (2024-02-07T17:07:41Z) - Domain Generalization via Balancing Training Difficulty and Model
Capability [61.053202176230904]
Domain generalization (DG) aims to learn domain-generalizable models from one or multiple source domains that can perform well in unseen target domains.
Despite its recent progress, most existing work suffers from the misalignment between the difficulty level of training samples and the capability of contemporarily trained models.
We design MoDify, a Momentum Difficulty framework that tackles the misalignment by balancing the seesaw between the model's capability and the samples' difficulties.
arXiv Detail & Related papers (2023-09-02T07:09:23Z) - Modularizing while Training: A New Paradigm for Modularizing DNN Models [20.892788625187702]
We propose a novel approach that incorporates modularization into the model training process, i.e., modularizing-while-training (MwT)
The accuracy loss caused by MwT is only 1.13 percentage points, which is 1.76 percentage points less than that of the baseline.
The total time cost required for training and modularizing is only 108 minutes, half of the baseline.
arXiv Detail & Related papers (2023-06-15T07:45:43Z) - ModuleFormer: Modularity Emerges from Mixture-of-Experts [60.6148988099284]
This paper proposes a new neural network architecture, ModuleFormer, to improve the efficiency and flexibility of large language models.
Unlike the previous SMoE-based modular language model, ModuleFormer can induce modularity from uncurated data.
arXiv Detail & Related papers (2023-06-07T17:59:57Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Few-Shot Learning of Compact Models via Task-Specific Meta Distillation [16.683801607142257]
We consider a new problem of few-shot learning of compact models.
We propose task-specific meta distillation that simultaneously learns two models in meta-learning.
arXiv Detail & Related papers (2022-10-18T15:06:47Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.