Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
- URL: http://arxiv.org/abs/2408.07802v2
- Date: Sat, 17 Aug 2024 00:58:10 GMT
- Title: Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
- Authors: Rohan Baskar Prabhakar, Hengrui Zhang, David Wentzlaff,
- Abstract summary: Kraken is an evolution of the standard Transformer architecture for efficient inference on multi-device systems.
When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers.
When tested on the SuperGLUE benchmark, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes.
- Score: 8.527031391688283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Transformer networks are increasingly used in settings where low inference latency can improve the end-user experience and enable new applications. However, autoregressive inference is resource intensive and requires parallelism for efficiency. Parallelism introduces collective communication that is both expensive and represents a phase when hardware resources are underutilized. Towards mitigating this, Kraken is an evolution of the standard Transformer architecture that is designed to complement existing tensor parallelism schemes for efficient inference on multi-device systems. By introducing a fixed degree of intra-layer model parallelism, the architecture allows collective operations to be overlapped with compute, decreasing latency and increasing hardware utilization. When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers while also preserving their language modeling capabilities when evaluated on the SuperGLUE benchmark. Importantly, when tested on multi-GPU systems using TensorRT-LLM engines, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes, context lengths, and degrees of tensor parallelism.
Related papers
- Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping [36.71999572939612]
We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models.
Applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices.
We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline.
arXiv Detail & Related papers (2025-01-11T17:06:30Z) - DeMo: Decoupled Momentum Optimization [6.169574689318864]
Training large neural networks typically requires sharing between accelerators through specialized high-speed interconnects.
We introduce bfDecoupled textbfMomentum (DeMo), a fused magnitude and data parallel algorithm that reduces inter-accelerator communication requirements.
Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW.
arXiv Detail & Related papers (2024-11-29T17:31:47Z) - Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z) - Flover: A Temporal Fusion Framework for Efficient Autoregressive Model
Parallel Inference [3.005912820808423]
Inference on autoregressive models harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens.
We propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel.
By orchestrating the token-level parallelism, Flover exhibits hardware optimal efficiency and significantly spares the system resources.
arXiv Detail & Related papers (2023-05-22T20:58:09Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL)
In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh.
We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.