ModuleFormer: Modularity Emerges from Mixture-of-Experts
- URL: http://arxiv.org/abs/2306.04640v2
- Date: Mon, 11 Sep 2023 19:31:26 GMT
- Title: ModuleFormer: Modularity Emerges from Mixture-of-Experts
- Authors: Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen,
Chuang Gan
- Abstract summary: This paper proposes a new neural network architecture, ModuleFormer, to improve the efficiency and flexibility of large language models.
Unlike the previous SMoE-based modular language model, ModuleFormer can induce modularity from uncurated data.
- Score: 60.6148988099284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable results. However,
existing models are expensive to train and deploy, and it is also difficult to
expand their knowledge beyond pre-training data without forgetting previous
knowledge. This paper proposes a new neural network architecture, ModuleFormer,
that leverages modularity to improve the efficiency and flexibility of large
language models. ModuleFormer is based on the Sparse Mixture of Experts (SMoE).
Unlike the previous SMoE-based modular language model, which requires
domain-labeled data to learn domain-specific experts, ModuleFormer can induce
modularity from uncurated data with its new load balancing and concentration
losses. ModuleFormer is a modular architecture that includes two different
types of modules: new stick-breaking attention heads and feedforward experts.
Different modules are sparsely activated conditions on the input token during
training and inference. In our experiment, we found that the modular
architecture enables three important abilities for large pre-trained language
models: 1) Efficiency, since ModuleFormer only activates a subset of its
modules for each input token, thus it could achieve the same performance as
dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer
is more immune to catastrophic forgetting than dense LLMs and can be easily
extended with new modules to learn new knowledge that is not included in the
training data; 3) Specialisation, finetuning ModuleFormer could specialize a
subset of modules to the finetuning task and the task-unrelated modules could
be easily pruned for a lightweight deployment.
Related papers
- Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models [31.960749305728488]
We introduce a novel concept dubbed modular neural tangent kernel (mNTK)
We show that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $lambda_max$.
We propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $lambda_max$ exceeding a dynamic threshold.
arXiv Detail & Related papers (2024-05-13T07:46:48Z) - Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs.
We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z) - SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models [71.78800549517298]
Continual learning (CL) ability is vital for deploying large language models (LLMs) in the dynamic world.
Existing methods devise the learning module to acquire task-specific knowledge with parameter-efficient tuning (PET) block and the selection module to pick out the corresponding one for the testing input.
We propose a novel Shared Attention Framework (SAPT) to align the PET learning and selection via the Shared Attentive Learning & Selection module.
arXiv Detail & Related papers (2024-01-16T11:45:03Z) - GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and
reusing ModulEs [64.49176353858792]
We propose generative neuro-symbolic visual reasoning by growing and reusing modules.
The proposed model performs competitively on standard tasks like visual question answering and referring expression comprehension.
It is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
arXiv Detail & Related papers (2023-11-08T18:59:05Z) - Unlocking Emergent Modularity in Large Language Models [27.12431620957652]
We show that standard Language Models (LMs) could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters.
Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning.
arXiv Detail & Related papers (2023-10-17T01:02:32Z) - CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules [51.82044734879657]
We propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions.
We find that CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests.
arXiv Detail & Related papers (2023-10-13T10:17:48Z) - Composing Parameter-Efficient Modules with Arithmetic Operations [20.119291936493788]
We propose to compose parameter-efficient modules through linear arithmetic operations in the weight space.
Our approach requires emphno additional training and enables highly flexible module composition.
We extend our approach to detoxify Alpaca-LoRA, the latest instruction-tuned large language model based on LLaMA.
arXiv Detail & Related papers (2023-06-26T17:33:21Z) - Modular Deep Learning [120.36599591042908]
Transfer learning has recently become the dominant paradigm of machine learning.
It remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference.
Modular deep learning has emerged as a promising solution to these challenges.
arXiv Detail & Related papers (2023-02-22T18:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.