One-stop Training of Multiple Capacity Models
- URL: http://arxiv.org/abs/2305.14066v2
- Date: Wed, 24 May 2023 09:37:47 GMT
- Title: One-stop Training of Multiple Capacity Models
- Authors: Lan Jiang, Haoyang Huang, Dongdong Zhang, Rui Jiang, Furu Wei
- Abstract summary: We propose a novel one-stop training framework to jointly train high-capacity and low-capactiy models.
Unlike knowledge distillation, where multiple capacity models are trained from scratch separately, our approach integrates supervisions from different capacity models simultaneously.
- Score: 74.87789190840527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training models with varying capacities can be advantageous for deploying
them in different scenarios. While high-capacity models offer better
performance, low-capacity models require fewer computing resources for training
and inference. In this work, we propose a novel one-stop training framework to
jointly train high-capacity and low-capactiy models. This framework consists of
two composite model architectures and a joint training algorithm called
Two-Stage Joint-Training (TSJT). Unlike knowledge distillation, where multiple
capacity models are trained from scratch separately, our approach integrates
supervisions from different capacity models simultaneously, leading to faster
and more efficient convergence. Extensive experiments on the multilingual
machine translation benchmark WMT10 show that our method outperforms
low-capacity baseline models and achieves comparable or better performance on
high-capacity models. Notably, the analysis demonstrates that our method
significantly influences the initial training process, leading to more
efficient convergence and superior solutions.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management [35.06717005729781]
Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components.
Development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems.
We build a prototype system and evaluate it on various large MT MM models.
Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.
arXiv Detail & Related papers (2024-09-05T09:10:40Z) - MSfusion: A Dynamic Model Splitting Approach for Resource-Constrained Machines to Collaboratively Train Larger Models [16.012249716875132]
We introduce MSfusion, an effective and efficient collaborative learning framework for training large models on resourceconstraint machines.
In each training round, each participant is assigned a subset of model parameters to train over local data, and aggregates with sub-models of other peers on common parameters.
Experiments on image and NLP tasks illustrate significant advantages of MSfusion in performance and efficiency for training large models.
arXiv Detail & Related papers (2024-07-04T04:06:24Z) - A Multi-Level Framework for Accelerating Training Transformer Models [5.268960238774481]
Training large-scale deep learning models poses an unprecedented demand for computing power.
We propose a multi-level framework for training acceleration based on Coalescing, De-coalescing and Interpolation.
We prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model.
arXiv Detail & Related papers (2024-04-07T03:04:34Z) - Co-training and Co-distillation for Quality Improvement and Compression
of Language Models [88.94539115180919]
Knowledge Distillation (KD) compresses expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models.
Most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed.
We propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models.
arXiv Detail & Related papers (2023-11-06T03:29:00Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.