Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference
- URL: http://arxiv.org/abs/2403.20306v1
- Date: Fri, 29 Mar 2024 17:22:48 GMT
- Title: Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference
- Authors: Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, Josep Torrellas,
- Abstract summary: Energy availability has come to the forefront as the biggest challenge for data center expansion to serve large language models.
We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient.
- Score: 6.68507515624183
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models. In this paper, we present the trade-offs brought up by making energy efficiency the primary goal of LLM serving under performance SLOs. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient. We characterize the impact of these knobs on the latency, throughput, as well as the energy. By exploring these trade-offs, we offer valuable insights into optimizing energy usage without compromising on performance, thereby paving the way for sustainable and cost-effective LLM deployment in data center environments.
Related papers
- Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads [0.2389598109913753]
Training and using Large Language Models (LLMs) require large amounts of energy.
This paper addresses the challenge of reducing energy consumption in data centers running LLMs.
We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate tasks across hardware accelerators.
arXiv Detail & Related papers (2024-04-25T11:24:08Z) - An LLM-Based Digital Twin for Optimizing Human-in-the Loop Systems [13.388869442538399]
We present a case study that employs large language models (LLMs) to mimic the behaviors and thermal preferences of various population groups in a shopping mall.
The aggregated thermal preferences are integrated into an agent-in-the-loop based reinforcement learning algorithm AitL-RL.
Our results show that LLMs are capable of simulating complex population movements within large open spaces.
arXiv Detail & Related papers (2024-03-25T14:32:28Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit
Large Language Models [75.1814102438065]
EE-Tuning is a solution to training/tuning early-exit large language models (LLMs)
It augments any pre-trained (and possibly fine-tuned) standard LLM with additional early-exit layers that are tuned in a parameter-efficient manner.
Our implementation achieves outstanding training efficiency via extensive performance optimizations.
arXiv Detail & Related papers (2024-02-01T11:39:04Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Efficient LLM Inference on CPUs [8.802223672775844]
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks.
deploying these models has been challenging due to the astronomical amount of model parameters.
We propose an effective approach that can make the deployment of LLMs more efficiently.
arXiv Detail & Related papers (2023-11-01T13:08:50Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - From Words to Watts: Benchmarking the Energy Costs of Large Language
Model Inference [19.439683873290623]
Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art.
These models carry significant computational challenges, especially the compute and energy costs required for inference.
arXiv Detail & Related papers (2023-10-04T17:41:59Z) - POLCA: Power Oversubscription in LLM Cloud Providers [0.8299593158757622]
Large language models (LLMs) are becoming increasingly power intensive.
We show that there is a significant opportunity to oversubscribe power in LLM clusters.
We propose POLCA, our framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters.
arXiv Detail & Related papers (2023-08-24T16:32:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.