Related papers: SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

URL: http://arxiv.org/abs/2408.05235v1
Date: Mon, 5 Aug 2024 09:07:06 GMT
Title: SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving
Authors: Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris,
Abstract summary: We present textitthrottLL'eM, a framework that reduces energy consumption while meeting Service-Level Objectives. textitthrottLL'eM features mechanisms that project future KV cache usage and batch size. We show that the proposed ML model achieves $R2$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average.
Score: 6.010159688581912
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present \textit{throttLL'eM}, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. \textit{throttLL'eM} features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, \textit{throttLL'eM} manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves $R^2$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that \textit{throttLL'eM} achieves up to 43.8\% lower energy consumption and an energy efficiency improvement of at least $1.71\times$ under SLOs, when compared to NVIDIA's Triton server.

Related papers

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency [6.306413686006502]
We conduct a comprehensive analysis of 28 quantized Large Language Models (LLMs) from the Ollama library. We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings.
arXiv Detail & Related papers (2025-04-04T11:29:30Z)
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs [96.68469559192846]
We present two differently sized MoE large language models (LLMs) Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. We propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency.
arXiv Detail & Related papers (2025-03-07T04:43:39Z)
Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity. LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning. LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z)
Densing Law of LLMs [81.06644243978101]
Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases. This paper introduces the concept of textitcapacity density'' as a new metric to evaluate the quality of the LLMs across different scales.
arXiv Detail & Related papers (2024-12-05T16:31:13Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models [87.43596173378913]
We propose an innovative strategy for existing MLLMs called $gamma$-MoD. In $gamma$-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM. Based on ARank, we propose two novel designs to maximize the computational sparsity of MLLM.
arXiv Detail & Related papers (2024-10-17T17:59:53Z)
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency [7.073435885680335]
We propose DynamoLLM, the first energy-management framework for generative large language models. At a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer.
arXiv Detail & Related papers (2024-08-01T17:40:45Z)
Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads [0.2389598109913753]
Training and using Large Language Models (LLMs) require large amounts of energy. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate tasks across hardware accelerators.
arXiv Detail & Related papers (2024-04-25T11:24:08Z)
Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference [6.68507515624183]
Energy availability has come to the forefront as the biggest challenge for data center expansion to serve large language models. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient.
arXiv Detail & Related papers (2024-03-29T17:22:48Z)
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization [9.517540904818986]
This paper proposes adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. Experiments on production inference workloads in 11 different clusters demonstrate that LLM-PQ achieves up to 2.88x (2.26x on average) throughput improvement in inference.
arXiv Detail & Related papers (2024-03-02T08:40:07Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference [19.439683873290623]
Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These models carry significant computational challenges, especially the compute and energy costs required for inference.
arXiv Detail & Related papers (2023-10-04T17:41:59Z)
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities. LVLMs are often problematic due to their massive computational/energy costs and carbon consumption. We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.