POLCA: Power Oversubscription in LLM Cloud Providers
- URL: http://arxiv.org/abs/2308.12908v1
- Date: Thu, 24 Aug 2023 16:32:34 GMT
- Title: POLCA: Power Oversubscription in LLM Cloud Providers
- Authors: Pratyush Patel, Esha Choukse, Chaojie Zhang, \'I\~nigo Goiri, Brijesh
Warrier, Nithish Mahalingam, Ricardo Bianchini
- Abstract summary: Large language models (LLMs) are becoming increasingly power intensive.
We show that there is a significant opportunity to oversubscribe power in LLM clusters.
We propose POLCA, our framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters.
- Score: 0.8299593158757622
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent innovation in large language models (LLMs), and their myriad use-cases
have rapidly driven up the compute capacity demand for datacenter GPUs. Several
cloud providers and other enterprises have made substantial plans of growth in
their datacenters to support these new workloads. One of the key bottleneck
resources in datacenters is power, and given the increasing model sizes of
LLMs, they are becoming increasingly power intensive. In this paper, we show
that there is a significant opportunity to oversubscribe power in LLM clusters.
Power oversubscription improves the power efficiency of these datacenters,
allowing more deployable servers per datacenter, and reduces the deployment
time, since building new datacenters is slow.
We extensively characterize the power consumption patterns of a variety of
LLMs and their configurations. We identify the differences between the
inference and training power consumption patterns. Based on our analysis of
these LLMs, we claim that the average and peak power utilization in LLM
clusters for inference should not be very high. Our deductions align with the
data from production LLM clusters, revealing that inference workloads offer
substantial headroom for power oversubscription. However, the stringent set of
telemetry and controls that GPUs offer in a virtualized environment, makes it
challenging to have a reliable and robust power oversubscription mechanism.
We propose POLCA, our framework for power oversubscription that is robust,
reliable, and readily deployable for GPU clusters. Using open-source models to
replicate the power patterns observed in production, we simulate POLCA and
demonstrate that we can deploy 30% more servers in the same GPU cluster for
inference, with minimal performance loss
Related papers
- Decentralized Diffusion Models [53.89995588977048]
Large-scale AI model training divides work across thousands of GPU, then synchronizes gradients across them at each step.
This incurs a significant network burden that only centralized, monolithic clusters can support.
We propose Decentralized Diffusion Models, a scalable framework for distributing diffusion model training across independent clusters.
arXiv Detail & Related papers (2025-01-09T18:59:56Z) - TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms [9.36320423249322]
The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in the cloud.
We propose TAPAS, a thermal-aware framework designed for LLM inference clusters in the cloud.
Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
arXiv Detail & Related papers (2025-01-05T16:51:17Z) - Power- and Fragmentation-aware Online Scheduling for GPU Datacenters [9.29180785233729]
We focus on two objectives: minimizing GPU fragmentation and reducing power consumption.
To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations.
We show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.
arXiv Detail & Related papers (2024-12-23T11:27:17Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads [0.2389598109913753]
Training and using Large Language Models (LLMs) require large amounts of energy.
This paper addresses the challenge of reducing energy consumption in data centers running LLMs.
We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate tasks across hardware accelerators.
arXiv Detail & Related papers (2024-04-25T11:24:08Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - The MIT Supercloud Dataset [3.375826083518709]
We introduce the MIT Supercloud dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations.
We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data.
This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data.
arXiv Detail & Related papers (2021-08-04T13:06:17Z) - Power Modeling for Effective Datacenter Planning and Compute Management [53.41102502425513]
We discuss two classes of statistical power models designed and validated to be accurate, simple, interpretable and applicable to all hardware configurations and workloads.
We demonstrate that the proposed statistical modeling techniques, while simple and scalable, predict power with less than 5% Mean Absolute Percent Error (MAPE) for more than 95% diverse Power Distribution Units (more than 2000) using only 4 features.
arXiv Detail & Related papers (2021-03-22T21:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.