SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based
Platforms
- URL: http://arxiv.org/abs/2301.12865v3
- Date: Fri, 1 Sep 2023 01:56:36 GMT
- Title: SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based
Platforms
- Authors: Yaodan Xu, Jingzhou Sun, Sheng Zhou, Zhisheng Niu
- Abstract summary: This paper aims to provide a dynamic graphics policy that strikes a balance between efficiency and latency.
The proposed solution has notable flexibility in balancing power consumption and latency.
- Score: 14.42787221783853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In up-to-date machine learning (ML) applications on cloud or edge computing
platforms, batching is an important technique for providing efficient and
economical services at scale. In particular, parallel computing resources on
the platforms, such as graphics processing units (GPUs), have higher
computational and energy efficiency with larger batch sizes. However, larger
batch sizes may also result in longer response time, and thus it requires a
judicious design. This paper aims to provide a dynamic batching policy that
strikes a balance between efficiency and latency. The GPU-based inference
service is modeled as a batch service queue with batch-size dependent
processing time. Then, the design of dynamic batching is a continuous-time
average-cost problem, and is formulated as a semi-Markov decision process
(SMDP) with the objective of minimizing the weighted sum of average response
time and average power consumption. The optimal policy is acquired by solving
an associated discrete-time Markov decision process (MDP) problem with finite
state approximation and "discretization". By introducing an abstract cost to
reflect the impact of "tail" states, the space complexity and the time
complexity of the procedure can decrease by 63.5% and 98%, respectively. Our
results show that the optimal policies potentially possess a control limit
structure. Numerical results also show that SMDP-based batching policies can
adapt to different traffic intensities and outperform other benchmark policies.
Furthermore, the proposed solution has notable flexibility in balancing power
consumption and latency.
Related papers
- When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL [37.58940726230092]
Reinforcement learning (RL) excels in optimizing policies for discrete-time Markov decision processes (MDP)
We formalize an RL framework, Time-adaptive Control & Sensing (TaCoS), that tackles this challenge.
We demonstrate that state-of-the-art RL algorithms trained on TaCoS drastically reduce the interaction amount over their discrete-time counterpart.
arXiv Detail & Related papers (2024-06-03T09:57:18Z) - Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge
Computing [11.403989519949173]
This work focuses on the timeliness of computational-intensive updates, measured by Age-ofInformation (AoI)
We study how to jointly optimize the task updating and offloading policies for AoI with fractional form.
Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks.
arXiv Detail & Related papers (2023-12-16T11:13:40Z) - Age-Based Scheduling for Mobile Edge Computing: A Deep Reinforcement
Learning Approach [58.911515417156174]
We propose a new definition of Age of Information (AoI) and, based on the redefined AoI, we formulate an online AoI problem for MEC systems.
We introduce Post-Decision States (PDSs) to exploit the partial knowledge of the system's dynamics.
We also combine PDSs with deep RL to further improve the algorithm's applicability, scalability, and robustness.
arXiv Detail & Related papers (2023-12-01T01:30:49Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Dynamic Scheduling for Federated Edge Learning with Streaming Data [56.91063444859008]
We consider a Federated Edge Learning (FEEL) system where training data are randomly generated over time at a set of distributed edge devices with long-term energy constraints.
Due to limited communication resources and latency requirements, only a subset of devices is scheduled for participating in the local training process in every iteration.
arXiv Detail & Related papers (2023-05-02T07:41:16Z) - Differentially Private Deep Q-Learning for Pattern Privacy Preservation
in MEC Offloading [76.0572817182483]
attackers may eavesdrop on the offloading decisions to infer the edge server's (ES's) queue information and users' usage patterns.
We propose an offloading strategy which jointly minimizes the latency, ES's energy consumption, and task dropping rate, while preserving pattern privacy (PP)
We develop a Differential Privacy Deep Q-learning based Offloading (DP-DQO) algorithm to solve this problem while addressing the PP issue by injecting noise into the generated offloading decisions.
arXiv Detail & Related papers (2023-02-09T12:50:18Z) - Faster Approximate Dynamic Programming by Freezing Slow States [5.6928413790238865]
We consider infinite horizon Markov decision processes (MDPs) with fast-slow structure.
Such structure is common in real-world problems where sequential decisions need to be made at high frequencies.
We propose an approximate dynamic programming framework based on the idea of "freezing" the slow states.
arXiv Detail & Related papers (2023-01-03T01:35:24Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - MCDS: AI Augmented Workflow Scheduling in Mobile Edge Cloud Computing
Systems [12.215537834860699]
Recently proposed scheduling methods leverage the low response times of edge computing platforms to optimize application Quality of Service (QoS)
We propose MCDS: Monte Carlo Learning using Deep Surrogate Models to efficiently schedule workflow applications in mobile edge-cloud computing systems.
arXiv Detail & Related papers (2021-12-14T10:00:01Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.