Related papers: Cost-Aware Dynamic Cloud Workflow Scheduling using Self-Attention and Evolutionary Reinforcement Learning

Cost-Aware Dynamic Cloud Workflow Scheduling using Self-Attention and Evolutionary Reinforcement Learning

URL: http://arxiv.org/abs/2409.18444v1
Date: Fri, 27 Sep 2024 04:45:06 GMT
Title: Cost-Aware Dynamic Cloud Workflow Scheduling using Self-Attention and Evolutionary Reinforcement Learning
Authors: Ya Shen, Gang Chen, Hui Ma, Mengjie Zhang,
Abstract summary: This paper proposes a novel self-attention policy network for cloud workflow scheduling. The trained SPN-CWS can effectively process all candidate instances simultaneously to identify the most suitable VM instance to execute every workflow task. Our method can noticeably outperform several state-of-the-art algorithms on multiple benchmark CDMWS problems.
Score: 7.653021685451039
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Cost-aware Dynamic Multi-Workflow Scheduling (CDMWS) in the cloud is a kind of cloud workflow management problem, which aims to assign virtual machine (VM) instances to execute tasks in workflows so as to minimize the total costs, including both the penalties for violating Service Level Agreement (SLA) and the VM rental fees. Powered by deep neural networks, Reinforcement Learning (RL) methods can construct effective scheduling policies for solving CDMWS problems. Traditional policy networks in RL often use basic feedforward architectures to separately determine the suitability of assigning any VM instances, without considering all VMs simultaneously to learn their global information. This paper proposes a novel self-attention policy network for cloud workflow scheduling (SPN-CWS) that captures global information from all VMs. We also develop an Evolution Strategy-based RL (ERL) system to train SPN-CWS reliably and effectively. The trained SPN-CWS can effectively process all candidate VM instances simultaneously to identify the most suitable VM instance to execute every workflow task. Comprehensive experiments show that our method can noticeably outperform several state-of-the-art algorithms on multiple benchmark CDMWS problems.

Related papers

Symmetry-Preserving Architecture for Multi-NUMA Environments (SPANE): A Deep Reinforcement Learning Approach for Dynamic VM Scheduling [28.72083501050024]
We introduce the Dynamic VM Allocation problem in Multi-NUMA PM (DVAMP) We propose SPANE, a novel deep reinforcement learning approach that exploits the problem's inherent symmetries. Experiments conducted on the Huawei-East-1 dataset demonstrate that SPANE outperforms existing baselines, reducing average VM wait time by 45%.
arXiv Detail & Related papers (2025-04-21T08:09:40Z)
Deep Reinforcement Learning for Job Scheduling and Resource Management in Cloud Computing: An Algorithm-Level Review [10.015735252600793]
Deep Reinforcement Learning (DRL) has emerged as a promising solution to these challenges. DRL enables systems to learn and adapt policies based on continuous observations of the environment. This survey provides a comprehensive review of DRL-based algorithms for job scheduling and resource management in cloud computing.
arXiv Detail & Related papers (2025-01-02T02:08:00Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms. We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning [61.294110816231886]
We introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP) SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. Demos and codes can be found in https://forrest-110.io/sparse_diffusion_policy/.
arXiv Detail & Related papers (2024-07-01T17:59:56Z)
Machine Learning Insides OptVerse AI Solver: Design Principles and Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
CILP: Co-simulation based Imitation Learner for Dynamic Resource Provisioning in Cloud Computing Environments [13.864161788250856]
Key challenge for latency-critical tasks is to predict future workload demands to provision proactively. Existing AI-based solutions tend to not holistically consider all crucial aspects such as provision overheads, heterogeneous VM costs and Quality of Service (QoS) of the cloud system. We propose a novel method, called CILP, that formulates the VM provisioning problem as two sub-problems of prediction and optimization.
arXiv Detail & Related papers (2023-02-11T09:15:34Z)
SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning Design and Training [4.015081523508339]
We propose SMLT, an automated, scalable, and adaptive serverless framework to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. Our experimental evaluation with large, sophisticated modern ML models demonstrate that SMLT outperforms the state-of-the-art VM based systems and existing serverless ML training frameworks in both training speed (up to 8X) and monetary cost (up to 3X)
arXiv Detail & Related papers (2022-05-04T02:11:26Z)
Controllable Dynamic Multi-Task Architectures [92.74372912009127]
We propose a controllable multi-task network that dynamically adjusts its architecture and weights to match the desired task preference as well as the resource constraints. We propose a disentangled training of two hypernetworks, by exploiting task affinity and a novel branching regularized loss, to take input preferences and accordingly predict tree-structured models with adapted weights.
arXiv Detail & Related papers (2022-03-28T17:56:40Z)
Semantic of Cloud Computing services for Time Series workflows [0.0]
Time series (TS) are present in many fields of knowledge, research, and engineering. The processing and analysis of TS are essential in order to extract knowledge from the data and to tackle forecasting or predictive maintenance tasks. The modeling of TS is a challenging task, requiring high statistical expertise as well as outstanding knowledge about the application of Data Mining(DM) and Machine Learning (ML) methods.
arXiv Detail & Related papers (2022-02-01T17:57:40Z)
VMAgent: Scheduling Simulator for Reinforcement Learning [44.026076801936874]
A novel simulator called VMAgent is introduced to help RL researchers better explore new methods. VMAgent is inspired by practical virtual machine (VM) scheduling tasks. From the VM scheduling perspective, VMAgent also helps to explore better learning-based scheduling solutions.
arXiv Detail & Related papers (2021-12-09T09:18:38Z)
Reinforcement Learning on Computational Resource Allocation of Cloud-based Wireless Networks [22.06811314358283]
Wireless networks used for Internet of Things (IoT) are expected to largely involve cloud-based computing and processing. In a cloud environment, dynamic computational resource allocation is essential to save energy while maintaining the performance of the processes. This paper models this dynamic computational resource allocation problem into a Markov Decision Process (MDP) and designs a model-based reinforcement-learning agent to optimise the dynamic resource allocation of the CPU usage. The results show that our agent rapidly converges to the optimal policy, stably performs in different settings, outperforms or at least equally performs compared to a baseline algorithm in energy savings for different scenarios.
arXiv Detail & Related papers (2020-10-10T15:16:26Z)
Federated Learning for Task and Resource Allocation in Wireless High Altitude Balloon Networks [160.96150373385768]
The problem of minimizing energy and time consumption for task computation and transmission is studied in a mobile edge computing (MEC)-enabled balloon network. A support vector machine (SVM)-based federated learning (FL) algorithm is proposed to determine the user association proactively. The proposed SVM-based FL method enables each HAB to cooperatively build an SVM model that can determine all user associations.
arXiv Detail & Related papers (2020-03-19T14:18:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.