BASE Layers: Simplifying Training of Large, Sparse Models
- URL: http://arxiv.org/abs/2103.16716v1
- Date: Tue, 30 Mar 2021 23:08:32 GMT
- Title: BASE Layers: Simplifying Training of Large, Sparse Models
- Authors: Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke
Zettlemoyer
- Abstract summary: We introduce a new balanced assignment of experts (BASE) layer for large language models.
Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules.
We formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.
- Score: 53.98145464002843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new balanced assignment of experts (BASE) layer for large
language models that greatly simplifies existing high capacity sparse layers.
Sparse layers can dramatically improve the efficiency of training and inference
by routing each token to specialized expert modules that contain only a small
fraction of the model parameters. However, it can be difficult to learn
balanced routing functions that make full use of the available experts;
existing approaches typically use routing heuristics or auxiliary
expert-balancing loss functions. In contrast, we formulate token-to-expert
allocation as a linear assignment problem, allowing an optimal assignment in
which each expert receives an equal number of tokens. This optimal assignment
scheme improves efficiency by guaranteeing balanced compute loads, and also
simplifies training by not requiring any new hyperparameters or auxiliary
losses. Code is publicly released at https://github.com/pytorch/fairseq/
Related papers
- GOAL: A Generalist Combinatorial Optimization Agent Learning [0.05461938536945722]
GOAL is a model capable of efficiently solving multiple hard optimization problems (COPs)
Goal consists of a single backbone plus light-weight problem-specific adapters for input and output processing.
We show that GOAL is only slightly inferior to the specialized baselines while being the first multi-task model that solves a wide range of COPs.
arXiv Detail & Related papers (2024-06-21T11:55:20Z) - Simplifying Neural Network Training Under Class Imbalance [77.39968702907817]
Real-world datasets are often highly class-imbalanced, which can adversely impact the performance of deep learning models.
The majority of research on training neural networks under class imbalance has focused on specialized loss functions, sampling techniques, or two-stage training procedures.
We demonstrate that simply tuning existing components of standard deep learning pipelines, such as the batch size, data augmentation, and label smoothing, can achieve state-of-the-art performance without any such specialized class imbalance methods.
arXiv Detail & Related papers (2023-12-05T05:52:44Z) - Maestro: Uncovering Low-Rank Structures via Trainable Decomposition [15.254107731735553]
Deep Neural Networks (DNNs) have been a large driver for AI breakthroughs in recent years.
They have been getting increasingly large as they become more accurate and safe.
This means that their training becomes increasingly costly and time-consuming.
We propose Maestro, a framework for trainable low-rank layers.
arXiv Detail & Related papers (2023-08-28T23:08:15Z) - LABO: Towards Learning Optimal Label Regularization via Bi-level
Optimization [25.188067240126422]
Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks.
We present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants.
We propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem.
arXiv Detail & Related papers (2023-05-08T18:04:18Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Subspace Regularizers for Few-Shot Class Incremental Learning [26.372024890126408]
We present a new family of subspace regularization schemes that encourage weight vectors for new classes to lie close to the subspace spanned by the weights of existing classes.
Our results show that simple geometric regularization of class representations offers an effective tool for continual learning.
arXiv Detail & Related papers (2021-10-13T22:19:53Z) - Hash Layers For Large Sparse Models [48.90784451703753]
We modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence.
We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods.
arXiv Detail & Related papers (2021-06-08T14:54:24Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results.
Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples.
Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.