DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- URL: http://arxiv.org/abs/2408.00741v1
- Date: Thu, 1 Aug 2024 17:40:45 GMT
- Title: DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- Authors: Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse,
- Abstract summary: We propose DynamoLLM, the first energy-management framework for generative large language models.
At a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer.
- Score: 7.073435885680335
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.
Related papers
- DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.
We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.
Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - Advancing Generative Artificial Intelligence and Large Language Models for Demand Side Management with Internet of Electric Vehicles [52.43886862287498]
This paper explores the integration of large language models (LLMs) into energy management.
We propose an innovative solution that enhances LLMs with retrieval-augmented generation for automatic problem formulation, code generation, and customizing optimization.
We present a case study to demonstrate the effectiveness of our proposed solution in charging scheduling and optimization for electric vehicles.
arXiv Detail & Related papers (2025-01-26T14:31:03Z) - Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings [1.5749416770494706]
Large language models (LLMs) have shown significant improvements in many natural language processing (NLP) tasks.
LLMs are resource-intensive, requiring extensive computational resources both during training and inference.
As their adoption accelerates, the sustainability of LLMs has become a critical issue.
arXiv Detail & Related papers (2025-01-14T16:02:33Z) - TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms [9.36320423249322]
The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in the cloud.
We propose TAPAS, a thermal-aware framework designed for LLM inference clusters in the cloud.
Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
arXiv Detail & Related papers (2025-01-05T16:51:17Z) - Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System [75.25394449773052]
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving.
Yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods.
We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness.
arXiv Detail & Related papers (2024-10-10T17:00:06Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving [6.010159688581912]
We present textitthrottLL'eM, a framework that reduces energy consumption while meeting Service-Level Objectives.
textitthrottLL'eM features mechanisms that project future KV cache usage and batch size.
We show that the proposed ML model achieves $R2$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average.
arXiv Detail & Related papers (2024-08-05T09:07:06Z) - Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads [0.2389598109913753]
Training and using Large Language Models (LLMs) require large amounts of energy.
This paper addresses the challenge of reducing energy consumption in data centers running LLMs.
We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate tasks across hardware accelerators.
arXiv Detail & Related papers (2024-04-25T11:24:08Z) - Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference [6.68507515624183]
Energy availability has come to the forefront as the biggest challenge for data center expansion to serve large language models.
We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient.
arXiv Detail & Related papers (2024-03-29T17:22:48Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - Energy-Efficient Multi-Orchestrator Mobile Edge Learning [54.28419430315478]
Mobile Edge Learning (MEL) is a collaborative learning paradigm that features distributed training of Machine Learning (ML) models over edge devices.
In MEL, possible coexistence of multiple learning tasks with different datasets may arise.
We propose lightweight algorithms that can achieve near-optimal performance and facilitate the trade-offs between energy consumption, accuracy, and solution complexity.
arXiv Detail & Related papers (2021-09-02T07:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.