A Hardware Evaluation Framework for Large Language Model Inference
- URL: http://arxiv.org/abs/2312.03134v1
- Date: Tue, 5 Dec 2023 21:01:33 GMT
- Title: A Hardware Evaluation Framework for Large Language Model Inference
- Authors: Hengrui Zhang, August Ning, Rohan Prabhakar, David Wentzlaff
- Abstract summary: This work introduces LLM, a hardware evaluation framework for Large Language Models.
LLMs is fast, accurate, versatile, and able to describe and evaluate different hardware designs.
With the aid of LLM, this work draws architectural implications and explores new cost-effective hardware designs.
- Score: 9.073225245382854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The past year has witnessed the increasing popularity of Large Language
Models (LLMs). Their unprecedented scale and associated high hardware cost have
impeded their broader adoption, calling for efficient hardware designs. With
the large hardware needed to simply run LLM inference, evaluating different
hardware designs becomes a new bottleneck.
This work introduces LLMCompass, a hardware evaluation framework for LLM
inference workloads. LLMCompass is fast, accurate, versatile, and able to
describe and evaluate different hardware designs. LLMCompass includes a mapper
to automatically find performance-optimal mapping and scheduling. It also
incorporates an area-based cost model to help architects reason about their
design choices. Compared to real-world hardware, LLMCompass' estimated latency
achieves an average 10.4% error rate across various operators with various
input sizes and an average 4.1% error rate for LLM inference. With LLMCompass,
simulating a 4-NVIDIA A100 GPU node running GPT-3 175B inference can be done
within 16 minutes on commodity hardware, including 26,400 rounds of the
mapper's parameter search.
With the aid of LLMCompass, this work draws architectural implications and
explores new cost-effective hardware designs. By reducing the compute
capability or replacing High Bandwidth Memory (HBM) with traditional DRAM,
these new designs can achieve as much as 3.41x improvement in performance/cost
compared to an NVIDIA A100, making them promising choices for democratizing
LLMs.
LLMCompass is planned to be fully open-source.
Related papers
- LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services [0.5143325455623888]
LLM-Pilot is a first-of-its-kind system for characterizing and predicting performance of LLM inference services.
It learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM.
Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.
arXiv Detail & Related papers (2024-10-03T12:19:06Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - Demystifying AI Platform Design for Distributed Inference of Next-Generation LLM models [8.02264001053969]
Large language models (LLMs) have shown remarkable performance across a wide range of applications, often outperforming human experts.
With constant innovation in LLM serving optimizations and model architecture evolving at breakneck speed, the hardware requirements to meet Service Level Objectives (SLOs) remain an open research question.
We present an analytical tool, GenZ, to efficiently navigate the relationship between diverse LLM model architectures and AI platform design parameters.
arXiv Detail & Related papers (2024-06-03T18:00:50Z) - Optimizing LLM Queries in Relational Data Analytics Workloads [50.95919232839785]
Batch data analytics is a growing application for Large Language Models (LLMs)
LLMs enable users to perform a wide range of natural language tasks, such as classification, entity extraction, and translation, over large datasets.
We propose novel techniques that can significantly reduce the cost of LLM calls for relational data analytics workloads.
arXiv Detail & Related papers (2024-03-09T07:01:44Z) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58.
The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs.
It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z) - Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization? [7.674972936853123]
Large Language Models (LLMs) have demonstrated impressive capabilities to solve a wide range of tasks without being explicitly fine-tuned on task-specific datasets.
We investigate whether smaller, compact LLMs are a good alternative to the comparatively Larger LLMs2 to address significant costs associated with utilizing LLMs in the real world.
arXiv Detail & Related papers (2024-02-01T18:31:34Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.