Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
- URL: http://arxiv.org/abs/2410.04466v2
- Date: Mon, 14 Oct 2024 10:53:29 GMT
- Title: Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
- Authors: Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Yu Wang, Guohao Dai,
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation.
The advancements in generative LLMs are closely intertwined with the development of hardware capabilities.
This paper comprehensively surveys efficient generative LLM inference on different hardware platforms.
- Score: 32.827076621809965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms, which can point to the future trends and potential developments of generative LLMs and hardware technology for edge-side scenarios.
Related papers
- LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators [1.1028525384019312]
Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications.
We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs.
Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks.
arXiv Detail & Related papers (2024-10-31T18:34:59Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference [2.2231908139555734]
We propose a general performance modeling methodology and workload analysis of distributed LLM training and inference.
We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA)
arXiv Detail & Related papers (2024-07-19T19:49:05Z) - Demystifying Platform Requirements for Diverse LLM Inference Use Cases [7.233203254714951]
We present an analytical tool, GenZ, to study the relationship between large language models inference performance and various platform design parameters.
We quantify the platform requirements to support SOTA LLMs models like LLaMA and GPT-4 under diverse serving settings.
Ultimately, this work sheds light on the platform design considerations for unlocking the full potential of large language models across a spectrum of applications.
arXiv Detail & Related papers (2024-06-03T18:00:50Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - Dissecting the Runtime Performance of the Training, Fine-tuning, and
Inference of Large Language Models [26.2566707495948]
Large Language Models (LLMs) have seen great advance in both academia and industry.
We benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes.
Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs.
arXiv Detail & Related papers (2023-11-07T03:25:56Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - QIGen: Generating Efficient Kernels for Quantized Inference on Large
Language Models [22.055655390093722]
We present an automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.
Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.
arXiv Detail & Related papers (2023-07-07T17:46:08Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.