Machine Learning (ML)-Centric Resource Management in Cloud Computing: A
Review and Future Directions
- URL: http://arxiv.org/abs/2105.05079v1
- Date: Sun, 9 May 2021 08:03:58 GMT
- Title: Machine Learning (ML)-Centric Resource Management in Cloud Computing: A
Review and Future Directions
- Authors: Tahseen Khan, Wenhong Tian, Rajkumar Buyya
- Abstract summary: Infrastructure as a Service (I) is one of the most important and rapidly growing fields.
One of the most important aspects of cloud computing for I is resource management.
Machine learning is being used to handle a variety of resource management tasks.
- Score: 22.779373079539713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cloud computing has rapidly emerged as model for delivering Internet-based
utility computing services. In cloud computing, Infrastructure as a Service
(IaaS) is one of the most important and rapidly growing fields. Cloud providers
provide users/machines resources such as virtual machines, raw (block) storage,
firewalls, load balancers, and network devices in this service model. One of
the most important aspects of cloud computing for IaaS is resource management.
Scalability, quality of service, optimum utility, reduced overheads, increased
throughput, reduced latency, specialised environment, cost effectiveness, and a
streamlined interface are some of the advantages of resource management for
IaaS in cloud computing. Traditionally, resource management has been done
through static policies, which impose certain limitations in various dynamic
scenarios, prompting cloud service providers to adopt data-driven,
machine-learning-based approaches. Machine learning is being used to handle a
variety of resource management tasks, including workload estimation, task
scheduling, VM consolidation, resource optimization, and energy optimization,
among others. This paper provides a detailed review of challenges in ML-based
resource management in current research, as well as current approaches to
resolve these challenges, as well as their advantages and limitations. Finally,
we propose potential future research directions based on identified challenges
and limitations in current research.
Related papers
- Application of Machine Learning Optimization in Cloud Computing Resource
Scheduling and Management [18.462300407761873]
The scale of cloud computing in China has reached 209.1 billion yuan.
This paper proposes an innovative approach to solve complex problems in cloud computing resource scheduling and management.
arXiv Detail & Related papers (2024-02-27T05:14:27Z) - Computing in the Era of Large Generative Models: From Cloud-Native to
AI-Native [46.7766555589807]
We describe an AI-native computing paradigm that harnesses the power of both cloudnative technologies and advanced machine learning inference.
These joint efforts aim to optimize costs-of-goods-sold (COGS) and improve resource accessibility.
arXiv Detail & Related papers (2024-01-17T20:34:11Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - Measuring the Carbon Intensity of AI in Cloud Instances [91.28501520271972]
We provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions.
We evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform.
arXiv Detail & Related papers (2022-06-10T17:04:04Z) - A Survey on Machine Learning for Geo-Distributed Cloud Data Center
Management [4.226118870861363]
Cloud service providers have been distributing data centers globally to reduce operating costs and improve quality of service.
Such large scale and complex orchestration of software workload and hardware resources remains a difficult problem to solve efficiently.
We review the state-of-the-art Machine Learning techniques for the cloud data center management problem.
arXiv Detail & Related papers (2022-05-17T03:14:54Z) - HUNTER: AI based Holistic Resource Management for Sustainable Cloud
Computing [26.48962351761643]
We propose an artificial intelligence (AI) based holistic resource management technique for sustainable cloud computing called HUNTER.
The proposed model formulates the goal of optimizing energy efficiency in data centers as a multi-objective scheduling problem.
Experiments on simulated and physical cloud environments show that HUNTER outperforms state-of-the-art baselines in terms of energy consumption, SLA violation, scheduling time, cost and temperature by up to 12, 35, 43, 54 and 3 percent respectively.
arXiv Detail & Related papers (2021-10-11T18:11:26Z) - The MIT Supercloud Dataset [3.375826083518709]
We introduce the MIT Supercloud dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations.
We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data.
This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data.
arXiv Detail & Related papers (2021-08-04T13:06:17Z) - Power Modeling for Effective Datacenter Planning and Compute Management [53.41102502425513]
We discuss two classes of statistical power models designed and validated to be accurate, simple, interpretable and applicable to all hardware configurations and workloads.
We demonstrate that the proposed statistical modeling techniques, while simple and scalable, predict power with less than 5% Mean Absolute Percent Error (MAPE) for more than 95% diverse Power Distribution Units (more than 2000) using only 4 features.
arXiv Detail & Related papers (2021-03-22T21:22:51Z) - Artificial Intelligence (AI)-Centric Management of Resources in Modern
Distributed Computing Systems [22.550075095184514]
Cloud Data Centres (DCS) are large scale, complex, heterogeneous, and distributed across multiple networks and geographical boundaries.
The Internet of Things (IoT)-driven applications are producing a huge amount of data that requires real-time processing and fast response.
Existing Resource Management Systems (RMS) rely on either static or solutions inadequate for such composite and dynamic systems.
arXiv Detail & Related papers (2020-06-09T06:54:07Z) - A Privacy-Preserving Distributed Architecture for
Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service.
It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z) - Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G
Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC.
To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.