Related papers: PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

URL: http://arxiv.org/abs/2202.13481v1
Date: Sun, 27 Feb 2022 23:30:55 GMT
Title: PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers
Authors: Yunseong Kim, Yujeong Choi, Minsoo Rhu
Abstract summary: NVIDIA's Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions" In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server.
Score: 0.9854614058492648
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the total-cost-of-ownership. GPUs have oftentimes been criticized for ML inference usages as its massive compute and memory throughput is hard to be fully utilized under low-batch inference scenarios. To address such limitation, NVIDIA's recently announced Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions". Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server. Our first proposition is a sophisticated partitioning algorithm for reconfigurable GPUs that systematically determines a heterogeneous set of multi-granular GPU partitions, best suited for the inference server's deployment. Furthermore, we co-design an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server which effectively balances low latency and high GPU utilization.

Related papers

Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference [4.497936996651617]
Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck.
arXiv Detail & Related papers (2025-03-11T11:21:35Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach [1.076745840431781]
We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs. This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling.
arXiv Detail & Related papers (2024-05-14T16:40:06Z)
NeRF-XL: Scaling NeRFs with Multiple GPUs [72.75214892939411]
We present NeRF-XL, a principled method for distributing Neural Radiance Fields (NeRFs) across multiple GPU. We show improvements in reconstruction quality with larger parameter counts and speed improvements with more GPU. We demonstrate the effectiveness of NeRF-XL on a wide variety of datasets, including the largest open-source dataset to date, MatrixCity, containing 258K images covering a 25km2 city area.
arXiv Detail & Related papers (2024-04-24T21:43:15Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU. When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z)
An Analysis of Collocation on GPUs for Deep Learning Training [0.0]
Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads. In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
arXiv Detail & Related papers (2022-09-13T14:13:06Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning [7.05946599544139]
High throughput machine learning (ML) inference servers are critical for online service applications. These servers must provide a bounded latency for each request to support consistent service-level objective (SLO) This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
arXiv Detail & Related papers (2021-09-01T04:46:46Z)
GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z)
GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.