A Hardware Evaluation Framework for Large Language Model Inference
- URL: http://arxiv.org/abs/2312.03134v1
- Date: Tue, 5 Dec 2023 21:01:33 GMT
- Title: A Hardware Evaluation Framework for Large Language Model Inference
- Authors: Hengrui Zhang, August Ning, Rohan Prabhakar, David Wentzlaff
- Abstract summary: This work introduces LLM, a hardware evaluation framework for Large Language Models.
LLMs is fast, accurate, versatile, and able to describe and evaluate different hardware designs.
With the aid of LLM, this work draws architectural implications and explores new cost-effective hardware designs.
- Score: 9.073225245382854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The past year has witnessed the increasing popularity of Large Language
Models (LLMs). Their unprecedented scale and associated high hardware cost have
impeded their broader adoption, calling for efficient hardware designs. With
the large hardware needed to simply run LLM inference, evaluating different
hardware designs becomes a new bottleneck.
This work introduces LLMCompass, a hardware evaluation framework for LLM
inference workloads. LLMCompass is fast, accurate, versatile, and able to
describe and evaluate different hardware designs. LLMCompass includes a mapper
to automatically find performance-optimal mapping and scheduling. It also
incorporates an area-based cost model to help architects reason about their
design choices. Compared to real-world hardware, LLMCompass' estimated latency
achieves an average 10.4% error rate across various operators with various
input sizes and an average 4.1% error rate for LLM inference. With LLMCompass,
simulating a 4-NVIDIA A100 GPU node running GPT-3 175B inference can be done
within 16 minutes on commodity hardware, including 26,400 rounds of the
mapper's parameter search.
With the aid of LLMCompass, this work draws architectural implications and
explores new cost-effective hardware designs. By reducing the compute
capability or replacing High Bandwidth Memory (HBM) with traditional DRAM,
these new designs can achieve as much as 3.41x improvement in performance/cost
compared to an NVIDIA A100, making them promising choices for democratizing
LLMs.
LLMCompass is planned to be fully open-source.
Related papers
- Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - Not All Layers of LLMs Are Necessary During Inference [68.88671495401483]
We show that for some tasks, Large Language Models can achieve results comparable to the final output at some intermediate layers.
We propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance.
arXiv Detail & Related papers (2024-03-04T16:23:58Z) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58.
The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs.
It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z) - Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization? [7.674972936853123]
Large Language Models (LLMs) have demonstrated impressive capabilities to solve a wide range of tasks without being explicitly fine-tuned on task-specific datasets.
We investigate whether smaller, compact LLMs are a good alternative to the comparatively Larger LLMs2 to address significant costs associated with utilizing LLMs in the real world.
arXiv Detail & Related papers (2024-02-01T18:31:34Z) - FlightLLM: Efficient Large Language Model Inference with a Complete
Mapping Flow on FPGAs [23.381331567339526]
Transformer-based Large Language Models (LLMs) have made a significant impact on various domains.
This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs.
FlightLLM beats NVIDIA A100 GPU with 1.2$times$ higher throughput using the latest Versal VHK158 FPGA.
arXiv Detail & Related papers (2024-01-08T13:00:53Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.