Multi-model Machine Learning Inference Serving with GPU Spatial
Partitioning
- URL: http://arxiv.org/abs/2109.01611v1
- Date: Wed, 1 Sep 2021 04:46:46 GMT
- Title: Multi-model Machine Learning Inference Serving with GPU Spatial
Partitioning
- Authors: Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon,
Jaehyuk Huh
- Abstract summary: High throughput machine learning (ML) inference servers are critical for online service applications.
These servers must provide a bounded latency for each request to support consistent service-level objective (SLO)
This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
- Score: 7.05946599544139
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As machine learning techniques are applied to a widening range of
applications, high throughput machine learning (ML) inference servers have
become critical for online service applications. Such ML inference servers pose
two challenges: first, they must provide a bounded latency for each request to
support consistent service-level objective (SLO), and second, they can serve
multiple heterogeneous ML models in a system as certain tasks involve
invocation of multiple models and consolidating multiple models can improve
system utilization. To address the two requirements of ML inference servers,
this paper proposes a new ML inference scheduling framework for multi-model ML
inference servers. The paper first shows that with SLO constraints, current
GPUs are not fully utilized for ML inference tasks. To maximize the resource
efficiency of inference servers, a key mechanism proposed in this paper is to
exploit hardware support for spatial partitioning of GPU resources. With the
partitioning mechanism, a new abstraction layer of GPU resources is created
with configurable GPU resources. The scheduler assigns requests to virtual
GPUs, called gpu-lets, with the most effective amount of resources. The paper
also investigates a remedy for potential interference effects when two ML tasks
are running concurrently in a GPU. Our prototype implementation proves that
spatial partitioning enhances throughput by 102.6% on average while satisfying
SLOs.
Related papers
- DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving [15.01982917560918]
This paper proposes to harvest stranded GPU resources for offline LLM inference tasks.
We built ConServe, an LLM serving system that contains an execution engine that preempts running offline tasks.
Our evaluation demonstrates that ConServe achieves strong performance isolation when co-serving online and offline tasks.
arXiv Detail & Related papers (2024-10-02T04:12:13Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs [32.01139974519813]
We present RedCoast, a tool crafted to automate distributed training and inference for large language models (LLMs)
We also propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions.
As a result, Redco implementations exhibit significantly fewer lines of code compared to their official counterparts.
arXiv Detail & Related papers (2023-10-25T04:32:35Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs)
FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token.
We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable
Multi-GPU Inference Servers [0.9854614058492648]
NVIDIA's Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions"
In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server.
arXiv Detail & Related papers (2022-02-27T23:30:55Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.