Towards VM Rescheduling Optimization Through Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2505.17359v1
- Date: Fri, 23 May 2025 00:30:53 GMT
- Title: Towards VM Rescheduling Optimization Through Deep Reinforcement Learning
- Authors: Xianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du,
- Abstract summary: We develop a reinforcement learning system for VM rescheduling, VM2RL, which incorporates a set of customized techniques.<n>Our results show that VM2RL can achieve a performance comparable to the optimal solution but with a running time of seconds.
- Score: 9.4293010682986
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Modern industry-scale data centers need to manage a large number of virtual machines (VMs). Due to the continual creation and release of VMs, many small resource fragments are scattered across physical machines (PMs). To handle these fragments, data centers periodically reschedule some VMs to alternative PMs, a practice commonly referred to as VM rescheduling. Despite the increasing importance of VM rescheduling as data centers grow in size, the problem remains understudied. We first show that, unlike most combinatorial optimization tasks, the inference time of VM rescheduling algorithms significantly influences their performance, due to dynamic VM state changes during this period. This causes existing methods to scale poorly. Therefore, we develop a reinforcement learning system for VM rescheduling, VM2RL, which incorporates a set of customized techniques, such as a two-stage framework that accommodates diverse constraints and workload conditions, a feature extraction module that captures relational information specific to rescheduling, as well as a risk-seeking evaluation enabling users to optimize the trade-off between latency and accuracy. We conduct extensive experiments with data from an industry-scale data center. Our results show that VM2RL can achieve a performance comparable to the optimal solution but with a running time of seconds. Code and datasets are open-sourced: https://github.com/zhykoties/VMR2L_eurosys, https://drive.google.com/drive/folders/1PfRo1cVwuhH30XhsE2Np3xqJn2GpX5qy.
Related papers
- Enhancing Robustness and Efficiency of Least Square Twin SVM via Granular Computing [0.2999888908665658]
In the domain of machine learning, least square twin support vector machine (LSTSVM) stands out as one of the state-of-the-art models.<n>LSTSVM suffers from sensitivity to noise and inversions, overlooking the principle and instability in resampling.<n>We propose the robust granular ball LSTSVM (GBLSTSVM), which is trained using granular balls instead of original data points.
arXiv Detail & Related papers (2024-10-22T18:13:01Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Recipe for Fast Large-scale SVM Training: Polishing, Parallelism, and
more RAM! [0.0]
Support vector machines (SVMs) are a standard method in the machine learning toolbox.
Non-linear kernel SVMs often deliver highly accurate predictors, however, at the cost of long training times.
In this work, we combine both approaches to design an extremely fast dual SVM solver.
arXiv Detail & Related papers (2022-07-03T11:51:41Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - VMAgent: Scheduling Simulator for Reinforcement Learning [44.026076801936874]
A novel simulator called VMAgent is introduced to help RL researchers better explore new methods.
VMAgent is inspired by practical virtual machine (VM) scheduling tasks.
From the VM scheduling perspective, VMAgent also helps to explore better learning-based scheduling solutions.
arXiv Detail & Related papers (2021-12-09T09:18:38Z) - Dynamic Network-Assisted D2D-Aided Coded Distributed Learning [59.29409589861241]
We propose a novel device-to-device (D2D)-aided coded federated learning method (D2D-CFL) for load balancing across devices.
We derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time.
Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data.
arXiv Detail & Related papers (2021-11-26T18:44:59Z) - Combination of Convolutional Neural Network and Gated Recurrent Unit for
Energy Aware Resource Allocation [0.0]
Cloud computing service models have experienced rapid growth and inefficient resource usage is one of the greatest causes of high energy consumption in cloud data centers.
Resource allocation in cloud data centers aiming to reduce energy consumption has been conducted using live migration of Virtual Machines (VMs) and their consolidation into the small number of Physical Machines (PMs)
To solve this issue, can be classified according to the pattern of user requests into sensitive or insensitive classes to latency, and thereafter suitable VM can be selected for migration.
arXiv Detail & Related papers (2021-06-23T05:57:51Z) - AML-SVM: Adaptive Multilevel Learning with Support Vector Machines [0.0]
This paper proposes an adaptive multilevel learning framework for the nonlinear SVM.
It improves the classification quality across the refinement process, and leverages multi-threaded parallel processing for better performance.
arXiv Detail & Related papers (2020-11-05T00:17:02Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.