Related papers: Efficiently Scaling Transformer Inference

Efficiently Scaling Transformer Inference

URL: http://arxiv.org/abs/2211.05102v1
Date: Wed, 9 Nov 2022 18:50:38 GMT
Title: Efficiently Scaling Transformer Inference
Authors: Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean
Abstract summary: We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices. We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
Score: 8.196193683641582
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model.

Related papers

FFN Fusion: Rethinking Sequential Computation in Large Language Models [16.8637819797503]
We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations. Applying these techniques to Llama-3.1-405B-Instruct, we create an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost.
arXiv Detail & Related papers (2025-03-24T17:20:35Z)
Hyper Compressed Fine-Tuning of Large Foundation Models with Quantum Inspired Adapters [0.0]
emphQuantum-Inspired Adapters, a PEFT approach inspired by Hamming-weight quantum circuits from quantum machine learning literature. We test our proposed adapters by adapting large language models and large vision transformers on benchmark datasets.
arXiv Detail & Related papers (2025-02-10T13:06:56Z)
MOFHEI: Model Optimizing Framework for Fast and Efficient Homomorphically Encrypted Neural Network Inference [0.8388591755871735]
Homomorphic Encryption (HE) enables us to perform machine learning tasks over encrypted data. We propose MOFHEI, a framework that optimize the model to make HE-based neural network inference, fast and efficient. Our framework achieves up to 98% pruning ratio on LeNet, eliminating up to 93% of the required HE operations for performing PI.
arXiv Detail & Related papers (2024-12-10T22:44:54Z)
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs [17.72841008597783]
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B), a publicly available model derived from Llama-3.1-70B-Instruct.
arXiv Detail & Related papers (2024-11-28T13:45:42Z)
FluidML: Fast and Memory Efficient Inference Optimization [3.7676096626244986]
We present FluidML, a generic runtime memory management and optimization framework. We show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models. We also show that FluidML can reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-14T07:16:23Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE) ParFormer improves feature extraction by combining convolutional and attention mechanisms. For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S. The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z)
AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z)
SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification [14.559316921646356]
Fine-tuning Transformer-based models for downstream tasks has long running time and high memory consumption. We propose the SPT system to fine-tune Transformer-based models efficiently by introducing sparsity. SPT consistently outperforms well-optimized baselines, reducing the peak memory consumption by up to 50% and accelerating fine-tuning by up to 2.2x.
arXiv Detail & Related papers (2023-12-16T07:44:52Z)
MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B. We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering [75.86788916930377]
bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models. One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2. Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
arXiv Detail & Related papers (2022-03-24T02:26:04Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Recurrent multiple shared layers in Depth for Neural Machine Translation [11.660776324473645]
We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction. Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our model has similar model performance and infer speed, but our model parameters are 54.72% of the former.
arXiv Detail & Related papers (2021-08-23T21:21:45Z)
FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs. We replace the self-attention sublayers with simple linear transformations that "mix" input tokens. The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.