From Words to Watts: Benchmarking the Energy Costs of Large Language
Model Inference
- URL: http://arxiv.org/abs/2310.03003v1
- Date: Wed, 4 Oct 2023 17:41:59 GMT
- Title: From Words to Watts: Benchmarking the Energy Costs of Large Language
Model Inference
- Authors: Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas,
Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, Vijay
Gadepally
- Abstract summary: Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art.
These models carry significant computational challenges, especially the compute and energy costs required for inference.
- Score: 19.439683873290623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have exploded in popularity due to their new
generative capabilities that go far beyond prior state-of-the-art. These
technologies are increasingly being leveraged in various domains such as law,
finance, and medicine. However, these models carry significant computational
challenges, especially the compute and energy costs required for inference.
Inference energy costs already receive less attention than the energy costs of
training LLMs -- despite how often these large models are called on to conduct
inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see
increasing usage and deployment in various domains, a better understanding of
their resource utilization is crucial for cost-savings, scaling performance,
efficient hardware usage, and optimal inference strategies.
In this paper, we describe experiments conducted to study the computational
and energy utilization of inference with LLMs. We benchmark and conduct a
preliminary analysis of the inference performance and inference energy costs of
different sizes of LLaMA -- a recent state-of-the-art LLM -- developed by Meta
AI on two generations of popular GPUs (NVIDIA V100 \& A100) and two datasets
(Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in
research and practice. We present the results of multi-node, multi-GPU
inference using model sharding across up to 32 GPUs. To our knowledge, our work
is the one of the first to study LLM inference performance from the perspective
of computational and energy resources at this scale.
Related papers
- A Comprehensive Study on Quantization Techniques for Large Language Models [0.0]
Large Language Models (LLMs) have been extensively researched and used in both academia and industry.
LLMs present significant challenges for deployment on resource-constrained IoT devices and embedded systems.
Quantization, a technique that reduces the precision of model values to a smaller set of discrete values, offers a promising solution.
arXiv Detail & Related papers (2024-10-30T04:55:26Z) - EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty.
We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications.
Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z) - Hardware Acceleration of LLMs: A comprehensive survey and comparison [0.0]
Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text.
We present a comprehensive survey of the several research efforts that have been presented for the acceleration of transformer networks for Large Language Models using hardware accelerators.
arXiv Detail & Related papers (2024-09-05T09:43:25Z) - SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving [6.010159688581912]
We present textitthrottLL'eM, a framework that reduces energy consumption while meeting Service-Level Objectives.
textitthrottLL'eM features mechanisms that project future KV cache usage and batch size.
We show that the proposed ML model achieves $R2$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average.
arXiv Detail & Related papers (2024-08-05T09:07:06Z) - Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads [0.2389598109913753]
Training and using Large Language Models (LLMs) require large amounts of energy.
This paper addresses the challenge of reducing energy consumption in data centers running LLMs.
We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate tasks across hardware accelerators.
arXiv Detail & Related papers (2024-04-25T11:24:08Z) - Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference [6.68507515624183]
Energy availability has come to the forefront as the biggest challenge for data center expansion to serve large language models.
We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient.
arXiv Detail & Related papers (2024-03-29T17:22:48Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - The Efficiency Spectrum of Large Language Models: An Algorithmic Survey [54.19942426544731]
The rapid growth of Large Language Models (LLMs) has been a driving force in transforming various domains.
This paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs.
arXiv Detail & Related papers (2023-12-01T16:00:25Z) - Power Hungry Processing: Watts Driving the Cost of AI Deployment? [74.19749699665216]
generative, multi-purpose AI systems promise a unified approach to building machine learning (ML) models into technology.
This ambition of generality'' comes at a steep cost to the environment, given the amount of energy these systems require and the amount of carbon that they emit.
We measure deployment cost as the amount of energy and carbon required to perform 1,000 inferences on representative benchmark dataset using these models.
We conclude with a discussion around the current trend of deploying multi-purpose generative ML systems, and caution that their utility should be more intentionally weighed against increased costs in terms of energy and emissions
arXiv Detail & Related papers (2023-11-28T15:09:36Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.