BASE Layers: Simplifying Training of Large, Sparse Models
- URL: http://arxiv.org/abs/2103.16716v1
- Date: Tue, 30 Mar 2021 23:08:32 GMT
- Title: BASE Layers: Simplifying Training of Large, Sparse Models
- Authors: Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke
Zettlemoyer
- Abstract summary: We introduce a new balanced assignment of experts (BASE) layer for large language models.
Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules.
We formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.
- Score: 53.98145464002843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new balanced assignment of experts (BASE) layer for large
language models that greatly simplifies existing high capacity sparse layers.
Sparse layers can dramatically improve the efficiency of training and inference
by routing each token to specialized expert modules that contain only a small
fraction of the model parameters. However, it can be difficult to learn
balanced routing functions that make full use of the available experts;
existing approaches typically use routing heuristics or auxiliary
expert-balancing loss functions. In contrast, we formulate token-to-expert
allocation as a linear assignment problem, allowing an optimal assignment in
which each expert receives an equal number of tokens. This optimal assignment
scheme improves efficiency by guaranteeing balanced compute loads, and also
simplifies training by not requiring any new hyperparameters or auxiliary
losses. Code is publicly released at https://github.com/pytorch/fairseq/
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Bilevel ZOFO: Bridging Parameter-Efficient and Zeroth-Order Techniques for Efficient LLM Fine-Tuning and Meta-Training [44.48966200270378]
Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO)imats presents significant computational challenges.
We propose a bilevel optimization framework that complements ZO methods with PEFT to mitigate sensitivity to hard prompts.
Our Bilevel ZOFO method employs a double-loop optimization strategy, where only the gradient of the PEFT model and the forward pass of the base model are required.
arXiv Detail & Related papers (2025-02-05T20:47:44Z) - Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable Approach [27.395660760819133]
It is easy to manipulate saved task vectors with arithmetic for different purposes, but compositional flexibility demands high memory usage.
This work addresses these issues with a theoretically grounded framework that explains task vector arithmetic and introduces the task vector bases framework.
Our method significantly reduces the memory cost for downstream arithmetic with little effort, while achieving competitive performance and maintaining compositional advantage.
arXiv Detail & Related papers (2025-02-03T03:18:26Z) - GOAL: A Generalist Combinatorial Optimization Agent Learning [0.05461938536945722]
GOAL is a model capable of efficiently solving multiple hard optimization problems (COPs)
Goal consists of a single backbone plus light-weight problem-specific adapters for input and output processing.
We show that GOAL is only slightly inferior to the specialized baselines while being the first multi-task model that solves a wide range of COPs.
arXiv Detail & Related papers (2024-06-21T11:55:20Z) - Simplifying Neural Network Training Under Class Imbalance [77.39968702907817]
Real-world datasets are often highly class-imbalanced, which can adversely impact the performance of deep learning models.
The majority of research on training neural networks under class imbalance has focused on specialized loss functions, sampling techniques, or two-stage training procedures.
We demonstrate that simply tuning existing components of standard deep learning pipelines, such as the batch size, data augmentation, and label smoothing, can achieve state-of-the-art performance without any such specialized class imbalance methods.
arXiv Detail & Related papers (2023-12-05T05:52:44Z) - LABO: Towards Learning Optimal Label Regularization via Bi-level
Optimization [25.188067240126422]
Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks.
We present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants.
We propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem.
arXiv Detail & Related papers (2023-05-08T18:04:18Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Hash Layers For Large Sparse Models [48.90784451703753]
We modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence.
We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods.
arXiv Detail & Related papers (2021-06-08T14:54:24Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.