Related papers: LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

URL: http://arxiv.org/abs/2411.00136v1
Date: Thu, 31 Oct 2024 18:34:59 GMT
Title: LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Authors: Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, Venkatram Vishwanath,
Abstract summary: Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs. Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks.
Score: 1.1028525384019312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. However, the computational demands of these complex models pose significant challenges, requiring efficient hardware acceleration. Benchmarking the performance of LLMs across diverse hardware platforms is crucial to understanding their scalability and throughput characteristics. We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs. We thoroughly analyze diverse hardware platforms, including GPUs from Nvidia and AMD and specialized AI accelerators, Intel Habana and SambaNova. Our evaluation includes several LLM inference frameworks and models from LLaMA, Mistral, and Qwen families with 7B and 70B parameters. Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks. We provide an interactive dashboard to help identify configurations for optimal performance for a given hardware platform.

Related papers

Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling [0.02091806248191979]
We introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators.<n>LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion.<n>We validate LIFE's forecasting with inference on AMD CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants.
arXiv Detail & Related papers (2025-07-29T03:08:31Z)
MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation [17.461533973039064]
MultiKernelBench is a benchmark for the generation of deep learning kernels using large language models (LLMs)<n>It spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms.<n>We show significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies.
arXiv Detail & Related papers (2025-07-20T00:58:33Z)
LLMPerf: GPU Performance Modeling meets Large Language Models [0.22499166814992436]
Large Language Models (LLMs) have demonstrated their effectiveness in addressing diverse programming challenges. Our work establishes a connection between LLMs and performance modeling, employing the LLM as a performance estimator. Our model achieves a mean absolute percentage error of $46.1%$ on a set of publicly available OpenCL programs.
arXiv Detail & Related papers (2025-03-14T09:52:30Z)
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency [0.0]
We evaluate various programming languages and frameworks, including PyTorch, Python, Mojo, C++, and Java. We investigate the Mojo SDK, a novel framework designed for large language model (LLM) inference on Apple Silicon. Our experiments, conducted on an Apple M1 Max, demonstrate Mojo SDK's competitive performance, ease of use, and seamless Python compatibility.
arXiv Detail & Related papers (2025-01-30T19:36:33Z)
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective [32.827076621809965]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. This paper comprehensively surveys efficient generative LLM inference on different hardware platforms.
arXiv Detail & Related papers (2024-10-06T12:42:04Z)
ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs. We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks. To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z)
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases [81.70591346986582]
We introduce MobileAIBench, a benchmarking framework for evaluating Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices.
arXiv Detail & Related papers (2024-06-12T22:58:12Z)
Demystifying Platform Requirements for Diverse LLM Inference Use Cases [7.233203254714951]
We present an analytical tool, GenZ, to study the relationship between large language models inference performance and various platform design parameters. We quantify the platform requirements to support SOTA LLMs models like LLaMA and GPT-4 under diverse serving settings. Ultimately, this work sheds light on the platform design considerations for unlocking the full potential of large language models across a spectrum of applications.
arXiv Detail & Related papers (2024-06-03T18:00:50Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding. The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning. This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)
A Performance Evaluation of a Quantized Large Language Model on Various Smartphones [0.0]
This paper explores the feasibility and performance of on-device large language model (LLM) inference on various Apple iPhone models. Leveraging existing literature on running multi-billion parameter LLMs on resource-limited devices, our study examines the thermal effects and interaction speeds of a high-performing LLM. We present real-world performance results, providing insights into on-device inference capabilities.
arXiv Detail & Related papers (2023-12-19T10:19:39Z)
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models [26.2566707495948]
Large Language Models (LLMs) have seen great advance in both academia and industry. We benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs.
arXiv Detail & Related papers (2023-11-07T03:25:56Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models [22.055655390093722]
We present an automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.
arXiv Detail & Related papers (2023-07-07T17:46:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.