Towards A Unified View of Sparse Feed-Forward Network in Pretraining
Large Language Model
- URL: http://arxiv.org/abs/2305.13999v3
- Date: Tue, 24 Oct 2023 03:41:37 GMT
- Title: Towards A Unified View of Sparse Feed-Forward Network in Pretraining
Large Language Model
- Authors: Zeyu Leo Liu, Tim Dettmers, Xi Victoria Lin, Veselin Stoyanov, Xian Li
- Abstract summary: Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models.
We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method.
We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
- Score: 58.9100867327305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE)
have proven effective in scaling up Transformers model size for
\textit{pretraining} large language models. By only activating part of the FFN
parameters conditioning on input, S-FFN improves generalization performance
while keeping training and inference costs (in FLOPs) fixed. In this work, we
analyzed two major design choices of S-FFN: the memory block (a.k.a. expert)
size and the memory block selection method under a general conceptual framework
of sparse neural memory. Using this unified framework, we compare several S-FFN
architectures for language modeling and provide insights into their relative
efficacy and efficiency. We found a simpler selection method --
\textbf{\texttt{Avg-K}} that selects blocks through their mean aggregated
hidden states, achieving lower perplexity in language model pretraining
compared to existing MoE architectures including Switch Transformer (Fedus et
al., 2021) and HashLayer (Roller et al., 2021).
Related papers
- RecurrentGemma: Moving Past Transformers for Efficient Open Language Models [103.59785165735727]
We introduce RecurrentGemma, a family of open language models using Google's novel Griffin architecture.
Griffin combines linear recurrences with local attention to achieve excellent performance on language.
We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both.
arXiv Detail & Related papers (2024-04-11T15:27:22Z) - Improving generalization in large language models by learning prefix
subspaces [5.911540700785975]
This article focuses on large language models (LLMs) fine-tuning in the scarce data regime (also known as the "few-shot" learning setting)
We propose a method to increase the generalization capabilities of LLMs based on neural network subspaces.
arXiv Detail & Related papers (2023-10-24T12:44:09Z) - Approximating Two-Layer Feedforward Networks for Efficient Transformers [15.793406740545024]
We present a general framework that unifies various methods to approximate two-layer NNs, including product-key memories (PKMs)
We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales.
This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs.
arXiv Detail & Related papers (2023-10-16T21:23:16Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - FF2: A Feature Fusion Two-Stream Framework for Punctuation Restoration [27.14686854704104]
We propose a Feature Fusion two-stream framework (FF2) for punctuation restoration.
Specifically, one stream leverages a pre-trained language model to capture the semantic feature, while another auxiliary module captures the feature at hand.
Without additional data, the experimental results on the popular benchmark IWSLT demonstrate that FF2 achieves new SOTA performance.
arXiv Detail & Related papers (2022-11-09T06:18:17Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Applying Occam's Razor to Transformer-Based Dependency Parsing: What
Works, What Doesn't, and What is Really Necessary [9.347252855045125]
We study the choice of pre-trained embeddings and whether they use LSTM layers in graph-based dependency schemes.
We propose a simple but widely applicable architecture and configuration, achieving new state-of-the-art results (in terms of LAS) for 10 out of 12 diverse languages.
arXiv Detail & Related papers (2020-10-23T22:58:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.