Retentive Network: A Successor to Transformer for Large Language Models
- URL: http://arxiv.org/abs/2307.08621v4
- Date: Wed, 9 Aug 2023 08:53:08 GMT
- Title: Retentive Network: A Successor to Transformer for Large Language Models
- Authors: Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue,
Jianyong Wang, Furu Wei
- Abstract summary: We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
- Score: 91.6652200825638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose Retentive Network (RetNet) as a foundation
architecture for large language models, simultaneously achieving training
parallelism, low-cost inference, and good performance. We theoretically derive
the connection between recurrence and attention. Then we propose the retention
mechanism for sequence modeling, which supports three computation paradigms,
i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel
representation allows for training parallelism. The recurrent representation
enables low-cost $O(1)$ inference, which improves decoding throughput, latency,
and GPU memory without sacrificing performance. The chunkwise recurrent
representation facilitates efficient long-sequence modeling with linear
complexity, where each chunk is encoded parallelly while recurrently
summarizing the chunks. Experimental results on language modeling show that
RetNet achieves favorable scaling results, parallel training, low-cost
deployment, and efficient inference. The intriguing properties make RetNet a
strong successor to Transformer for large language models. Code will be
available at https://aka.ms/retnet.
Related papers
- ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference [8.527031391688283]
Kraken is an evolution of the standard Transformer architecture for efficient inference on multi-device systems.
When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers.
When tested on the SuperGLUE benchmark, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes.
arXiv Detail & Related papers (2024-08-14T20:24:03Z) - Efficient Parallel Reinforcement Learning Framework using the Reactor
Model [2.190190313041532]
Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources.
Existing frameworks, such as Ray, are not managing this orchestration efficiently.
We have proposed a solution implementing the reactor model, which enforces a set of actors to have a fixed communication pattern.
arXiv Detail & Related papers (2023-12-07T21:19:57Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z) - Restructuring, Pruning, and Adjustment of Deep Models for Parallel
Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers)
We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model.
We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.