Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with
Rank Reordering
- URL: http://arxiv.org/abs/2105.14088v1
- Date: Fri, 28 May 2021 20:14:38 GMT
- Title: Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with
Rank Reordering
- Authors: Liang Luo, Jacob Nelson, Arvind Krishnamurthy, Luis Ceze
- Abstract summary: Cloud Collectives is a prototype that accelerates collectives by reorderingranks of participating frameworks.
Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers.
Preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads.
- Score: 8.81194405760133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: ML workloads are becoming increasingly popular in the cloud. Good cloud
training performance is contingent on efficient parameter exchange among VMs.
We find that Collectives, the widely used distributed communication algorithms,
cannot perform optimally out of the box due to the hierarchical topology of
datacenter networks and multi-tenancy nature of the cloudenvironment.In this
paper, we present Cloud Collectives , a prototype that accelerates collectives
by reordering theranks of participating VMs such that the communication pattern
dictated by the selected collectives operation best exploits the locality in
the network.Collectives is non-intrusive, requires no code changes nor rebuild
of an existing application, and runs without support from cloud providers. Our
preliminary application of Cloud Collectives on allreduce operations in public
clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x
in real-world workloads of distributed training of deep neural networks and
gradient boosted decision trees using state-of-the-art frameworks.
Related papers
- Cloudy with a Chance of Anomalies: Dynamic Graph Neural Network for Early Detection of Cloud Services' User Anomalies [9.035212370386846]
This paper introduces a pioneering time-based embedding approach for Cloud Services Graph-based Anomaly Detection (CS-GAD)
Our method employs a dynamic tripartite graph representation to encapsulate the evolving interactions among cloud services, users, and their activities over time.
Results demonstrate a notable reduction in false positive rates (2-9%) compared to prevailing methods, coupled with a commendable true positive rate (100%)
arXiv Detail & Related papers (2024-09-19T12:50:31Z) - Point Cloud Compression with Implicit Neural Representations: A Unified Framework [54.119415852585306]
We present a pioneering point cloud compression framework capable of handling both geometry and attribute components.
Our framework utilizes two coordinate-based neural networks to implicitly represent a voxelized point cloud.
Our method exhibits high universality when contrasted with existing learning-based techniques.
arXiv Detail & Related papers (2024-05-19T09:19:40Z) - Efficient Cloud-edge Collaborative Inference for Object
Re-identification [27.952445808987036]
We pioneer a cloud-edge collaborative inference framework for ReID systems.
We propose a distribution-aware correlation modeling network (DaCM) to make the desired image return to the cloud server.
DaCM embeds the spatial-temporal correlations implicitly included in the timestamps into a graph structure, and it can be applied in the cloud to regulate the size of the upload window.
arXiv Detail & Related papers (2024-01-04T02:56:50Z) - ECLM: Efficient Edge-Cloud Collaborative Learning with Continuous
Environment Adaptation [47.35179593006409]
We propose ECLM, an edge-cloud collaborative learning framework for rapid model adaptation for dynamic edge environments.
We show that ECLM significantly improves model performance (e.g., 18.89% accuracy increase) and resource efficiency (e.g. 7.12x communication cost reduction) in adapting models to dynamic edge environments.
arXiv Detail & Related papers (2023-11-18T14:10:09Z) - Deep Reinforcement Learning Based Resource Allocation for Cloud Native
Wireless Network [20.377823731801456]
Cloud native technology has revolutionized 5G beyond and 6G communication networks, offering unprecedented levels of operational automation, flexibility, and adaptability.
The vast array of cloud native services and applications presents a new challenge in resource allocation for dynamic cloud computing environments.
We introduce deep reinforcement learning techniques and introduce two model-free algorithms capable of monitoring the network state and dynamically training allocation policies.
Our findings demonstrate significant improvements in network efficiency, underscoring the potential of our proposed techniques in unlocking the full potential of cloud native wireless networks.
arXiv Detail & Related papers (2023-05-10T15:32:22Z) - Managing Cold-start in The Serverless Cloud with Temporal Convolutional
Networks [0.0]
Serverless cloud is an innovative cloud service model that frees customers from most cloud management duties.
A big threat to the serverless cloud's performance is cold-start, which is when the time of provisioning the needed cloud resource to serve customers' requests incurs unacceptable costs to the service providers and/or the customers.
This paper proposes a novel low-coupling, high-cohesion ensemble policy that addresses the cold-start problem at infrastructure- and function-levels of the serverless cloud stack.
arXiv Detail & Related papers (2023-04-01T21:54:22Z) - An Efficient Split Fine-tuning Framework for Edge and Cloud
Collaborative Learning [20.118073642453034]
We design an efficient split fine-tuning framework for edge and cloud collaborative learning.
We compress the intermediate output of a neural network to reduce the communication volume between the edge device and the cloud server.
Our framework can reduce the communication traffic by 96 times with little impact on the model accuracy.
arXiv Detail & Related papers (2022-11-30T02:55:21Z) - Device-Cloud Collaborative Recommendation via Meta Controller [65.97416287295152]
We propose a meta controller to dynamically manage the collaboration between the on-device recommender and the cloud-based recommender.
On the basis of the counterfactual samples and the extended training, extensive experiments in the industrial recommendation scenarios show the promise of meta controller.
arXiv Detail & Related papers (2022-07-07T03:23:04Z) - Federated Dynamic Sparse Training: Computing Less, Communicating Less,
Yet Learning Better [88.28293442298015]
Federated learning (FL) enables distribution of machine learning workloads from the cloud to resource-limited edge devices.
We develop, implement, and experimentally validate a novel FL framework termed Federated Dynamic Sparse Training (FedDST)
FedDST is a dynamic process that extracts and trains sparse sub-networks from the target full network.
arXiv Detail & Related papers (2021-12-18T02:26:38Z) - Auto-Split: A General Framework of Collaborative Edge-Cloud AI [49.750972428032355]
This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud.
To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
arXiv Detail & Related papers (2021-08-30T08:03:29Z) - Device-Cloud Collaborative Learning for Recommendation [50.01289274123047]
We propose a novel MetaPatch learning approach on the device side to efficiently achieve "thousands of people with thousands of models" given a centralized cloud model.
With billions of updated personalized device models, we propose a "model-over-models" distillation algorithm, namely MoMoDistill, to update the centralized cloud model.
arXiv Detail & Related papers (2021-04-14T05:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.