A Survey on Machine Learning for Geo-Distributed Cloud Data Center
Management
- URL: http://arxiv.org/abs/2205.08072v1
- Date: Tue, 17 May 2022 03:14:54 GMT
- Title: A Survey on Machine Learning for Geo-Distributed Cloud Data Center
Management
- Authors: Ninad Hogade, Sudeep Pasricha
- Abstract summary: Cloud service providers have been distributing data centers globally to reduce operating costs and improve quality of service.
Such large scale and complex orchestration of software workload and hardware resources remains a difficult problem to solve efficiently.
We review the state-of-the-art Machine Learning techniques for the cloud data center management problem.
- Score: 4.226118870861363
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Cloud workloads today are typically managed in a distributed environment and
processed across geographically distributed data centers. Cloud service
providers have been distributing data centers globally to reduce operating
costs while also improving quality of service by using intelligent workload and
resource management strategies. Such large scale and complex orchestration of
software workload and hardware resources remains a difficult problem to solve
efficiently. Researchers and practitioners have been trying to address this
problem by proposing a variety of cloud management techniques. Mathematical
optimization techniques have historically been used to address cloud management
issues. But these techniques are difficult to scale to geo-distributed problem
sizes and have limited applicability in dynamic heterogeneous system
environments, forcing cloud service providers to explore intelligent
data-driven and Machine Learning (ML) based alternatives. The characterization,
prediction, control, and optimization of complex, heterogeneous, and
ever-changing distributed cloud resources and workloads employing ML
methodologies have received much attention in recent years. In this article, we
review the state-of-the-art ML techniques for the cloud data center management
problem. We examine the challenges and the issues in current research focused
on ML for cloud management and explore strategies for addressing these issues.
We also discuss advantages and disadvantages of ML techniques presented in the
recent literature and make recommendations for future research directions.
Related papers
- Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - The Efficiency Spectrum of Large Language Models: An Algorithmic Survey [54.19942426544731]
The rapid growth of Large Language Models (LLMs) has been a driving force in transforming various domains.
This paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs.
arXiv Detail & Related papers (2023-12-01T16:00:25Z) - Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for
Enhanced Deep Learning Performance and Efficiency [0.0]
In recent years, the integration of artificial intelligence (AI) and cloud computing has emerged as a promising avenue for addressing the growing computational demands of AI applications.
This paper presents a comprehensive study of scalable, distributed AI frameworks leveraging cloud computing for enhanced deep learning performance and efficiency.
arXiv Detail & Related papers (2023-04-26T15:38:00Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - Outsourcing Training without Uploading Data via Efficient Collaborative
Open-Source Sampling [49.87637449243698]
Traditional outsourcing requires uploading device data to the cloud server.
We propose to leverage widely available open-source data, which is a massive dataset collected from public and heterogeneous sources.
We develop a novel strategy called Efficient Collaborative Open-source Sampling (ECOS) to construct a proximal proxy dataset from open-source data for cloud training.
arXiv Detail & Related papers (2022-10-23T00:12:18Z) - Measuring the Carbon Intensity of AI in Cloud Instances [91.28501520271972]
We provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions.
We evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform.
arXiv Detail & Related papers (2022-06-10T17:04:04Z) - The MIT Supercloud Dataset [3.375826083518709]
We introduce the MIT Supercloud dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations.
We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data.
This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data.
arXiv Detail & Related papers (2021-08-04T13:06:17Z) - Machine Learning (ML)-Centric Resource Management in Cloud Computing: A
Review and Future Directions [22.779373079539713]
Infrastructure as a Service (I) is one of the most important and rapidly growing fields.
One of the most important aspects of cloud computing for I is resource management.
Machine learning is being used to handle a variety of resource management tasks.
arXiv Detail & Related papers (2021-05-09T08:03:58Z) - Machine learning for cloud resources management -- An overview [0.0]
This study explores the most important cloud resources management issues that have been combined with Machine Learning.
A big collection of researches is used to make sensible comparisons between the ML techniques that are used in the different kinds of cloud resources management fields.
We propose the most suitable ML model for each field.
arXiv Detail & Related papers (2021-01-28T13:23:00Z) - Artificial Intelligence (AI)-Centric Management of Resources in Modern
Distributed Computing Systems [22.550075095184514]
Cloud Data Centres (DCS) are large scale, complex, heterogeneous, and distributed across multiple networks and geographical boundaries.
The Internet of Things (IoT)-driven applications are producing a huge amount of data that requires real-time processing and fast response.
Existing Resource Management Systems (RMS) rely on either static or solutions inadequate for such composite and dynamic systems.
arXiv Detail & Related papers (2020-06-09T06:54:07Z) - Offline Reinforcement Learning: Tutorial, Review, and Perspectives on
Open Problems [108.81683598693539]
offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines.
We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods.
arXiv Detail & Related papers (2020-05-04T17:00:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.