Multi-model Machine Learning Inference Serving with GPU Spatial
Partitioning
- URL: http://arxiv.org/abs/2109.01611v1
- Date: Wed, 1 Sep 2021 04:46:46 GMT
- Title: Multi-model Machine Learning Inference Serving with GPU Spatial
Partitioning
- Authors: Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon,
Jaehyuk Huh
- Abstract summary: High throughput machine learning (ML) inference servers are critical for online service applications.
These servers must provide a bounded latency for each request to support consistent service-level objective (SLO)
This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
- Score: 7.05946599544139
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As machine learning techniques are applied to a widening range of
applications, high throughput machine learning (ML) inference servers have
become critical for online service applications. Such ML inference servers pose
two challenges: first, they must provide a bounded latency for each request to
support consistent service-level objective (SLO), and second, they can serve
multiple heterogeneous ML models in a system as certain tasks involve
invocation of multiple models and consolidating multiple models can improve
system utilization. To address the two requirements of ML inference servers,
this paper proposes a new ML inference scheduling framework for multi-model ML
inference servers. The paper first shows that with SLO constraints, current
GPUs are not fully utilized for ML inference tasks. To maximize the resource
efficiency of inference servers, a key mechanism proposed in this paper is to
exploit hardware support for spatial partitioning of GPU resources. With the
partitioning mechanism, a new abstraction layer of GPU resources is created
with configurable GPU resources. The scheduler assigns requests to virtual
GPUs, called gpu-lets, with the most effective amount of resources. The paper
also investigates a remedy for potential interference effects when two ML tasks
are running concurrently in a GPU. Our prototype implementation proves that
spatial partitioning enhances throughput by 102.6% on average while satisfying
SLOs.
Related papers
- Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs [32.01139974519813]
We present RedCoast, a tool crafted to automate distributed training and inference for large language models (LLMs)
We also propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions.
As a result, Redco implementations exhibit significantly fewer lines of code compared to their official counterparts.
arXiv Detail & Related papers (2023-10-25T04:32:35Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Fast Distributed Inference Serving for Large Language Models [12.682341873843882]
Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT.
The interactive nature of these applications demand low job completion time (JCT) for model inference.
We present FastServe, a distributed inference serving system for LLMs.
arXiv Detail & Related papers (2023-05-10T06:17:50Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Walle: An End-to-End, General-Purpose, and Large-Scale Production System
for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML)
Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment.
We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z) - PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable
Multi-GPU Inference Servers [0.9854614058492648]
NVIDIA's Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions"
In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server.
arXiv Detail & Related papers (2022-02-27T23:30:55Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.