Glinthawk: A Two-Tiered Architecture for Offline LLM Inference
- URL: http://arxiv.org/abs/2501.11779v2
- Date: Tue, 11 Feb 2025 17:36:32 GMT
- Title: Glinthawk: A Two-Tiered Architecture for Offline LLM Inference
- Authors: Pouya Hamadanian, Sadjad Fouladi,
- Abstract summary: Glinthawk is an architecture for offline Large Language Model (LLM) inference.<n>It improves throughput by $5.9times$ and reduces cost of generation by $2.8times$.<n>Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation.
- Score: 2.6498598849144472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by $5.9\times$ and reduces cost of generation by $2.8\times$, compared to paged attention baselines. For long sequence lengths, it achieves $16.3\times$ throughput improvement at $2.4\times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-focused applications such as batch processing. The prototype is publicly available at https://github.com/microsoft/glinthawk.
Related papers
- SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs [5.760049762453579]
Accelerating large language models with CPUs enables broader AI access at a lower cost and power consumption.
We provide a set of open-source customized sparse kernels that can speed up any PyTorch model.
We demonstrate for the first time the use of unstructured sparsity in the attention achieving a $1.14 times$ speedup over the current systems.
arXiv Detail & Related papers (2025-02-18T02:26:34Z) - Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference [20.68731158617374]
Dovetail is an approach that deploys the draft model on the GPU to generate draft tokens while allowing the target model to perform parallel verification on the CPU.<n>Dovetail achieves an inference speed of 5.86 tokens per second for LLaMA2-Chat-7B using 3GB of VRAM, representing an approximately 2.77x improvement over CPU-only inference.
arXiv Detail & Related papers (2024-12-25T15:45:18Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.<n>These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.<n>We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design [5.962184741057505]
This paper aims to address computational redundancy at all design levels in a memory-efficient manner.
We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance.
We introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff.
arXiv Detail & Related papers (2024-01-29T09:12:23Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Faster Attention Is What You Need: A Fast Self-Attention Neural Network
Backbone Architecture for the Edge via Double-Condensing Attention Condensers [71.40595908386477]
We introduce a new faster attention condenser design called double-condensing attention condensers.
The resulting backbone (which we name AttendNeXt) achieves significantly higher inference throughput on an embedded ARM processor.
These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
arXiv Detail & Related papers (2022-08-15T02:47:33Z) - FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z) - Fast Vision Transformers with HiLo Attention [40.8842135978138]
Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision.
We introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods.
Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation.
arXiv Detail & Related papers (2022-05-26T08:16:14Z) - LiteTransformerSearch: Training-free On-device Search for Efficient
Autoregressive Language Models [34.673688610935876]
We show that the latency and perplexity pareto-frontier can be found without need for any model training.
We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices.
We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
arXiv Detail & Related papers (2022-03-04T02:10:43Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z) - EL-Attention: Memory Efficient Lossless Attention for Generation [27.59275177303199]
We propose memory-efficient lossless attention (called EL-attention) to address this issue.
It avoids heavy operations for building multi-head keys and values, with no requirements of using cache.
We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks.
arXiv Detail & Related papers (2021-05-11T04:37:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.