LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
- URL: http://arxiv.org/abs/2411.00136v1
- Date: Thu, 31 Oct 2024 18:34:59 GMT
- Title: LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
- Authors: Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, Venkatram Vishwanath,
- Abstract summary: Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications.
We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs.
Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks.
- Score: 1.1028525384019312
- License:
- Abstract: Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. However, the computational demands of these complex models pose significant challenges, requiring efficient hardware acceleration. Benchmarking the performance of LLMs across diverse hardware platforms is crucial to understanding their scalability and throughput characteristics. We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs. We thoroughly analyze diverse hardware platforms, including GPUs from Nvidia and AMD and specialized AI accelerators, Intel Habana and SambaNova. Our evaluation includes several LLM inference frameworks and models from LLaMA, Mistral, and Qwen families with 7B and 70B parameters. Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks. We provide an interactive dashboard to help identify configurations for optimal performance for a given hardware platform.
Related papers
- Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective [32.827076621809965]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation.
The advancements in generative LLMs are closely intertwined with the development of hardware capabilities.
This paper comprehensively surveys efficient generative LLM inference on different hardware platforms.
arXiv Detail & Related papers (2024-10-06T12:42:04Z) - ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases [81.70591346986582]
We introduce MobileAIBench, a benchmarking framework for evaluating Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices.
MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices.
arXiv Detail & Related papers (2024-06-12T22:58:12Z) - Demystifying Platform Requirements for Diverse LLM Inference Use Cases [7.233203254714951]
We present an analytical tool, GenZ, to study the relationship between large language models inference performance and various platform design parameters.
We quantify the platform requirements to support SOTA LLMs models like LLaMA and GPT-4 under diverse serving settings.
Ultimately, this work sheds light on the platform design considerations for unlocking the full potential of large language models across a spectrum of applications.
arXiv Detail & Related papers (2024-06-03T18:00:50Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - A Performance Evaluation of a Quantized Large Language Model on Various
Smartphones [0.0]
This paper explores the feasibility and performance of on-device large language model (LLM) inference on various Apple iPhone models.
Leveraging existing literature on running multi-billion parameter LLMs on resource-limited devices, our study examines the thermal effects and interaction speeds of a high-performing LLM.
We present real-world performance results, providing insights into on-device inference capabilities.
arXiv Detail & Related papers (2023-12-19T10:19:39Z) - Dissecting the Runtime Performance of the Training, Fine-tuning, and
Inference of Large Language Models [26.2566707495948]
Large Language Models (LLMs) have seen great advance in both academia and industry.
We benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes.
Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs.
arXiv Detail & Related papers (2023-11-07T03:25:56Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z) - QIGen: Generating Efficient Kernels for Quantized Inference on Large
Language Models [22.055655390093722]
We present an automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.
Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.
arXiv Detail & Related papers (2023-07-07T17:46:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.