Related papers: Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

URL: http://arxiv.org/abs/2105.14088v1
Date: Fri, 28 May 2021 20:14:38 GMT
Title: Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Authors: Liang Luo, Jacob Nelson, Arvind Krishnamurthy, Luis Ceze
Abstract summary: Cloud Collectives is a prototype that accelerates collectives by reorderingranks of participating frameworks. Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads.
Score: 8.81194405760133
License: http://creativecommons.org/licenses/by/4.0/
Abstract: ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads of distributed training of deep neural networks and gradient boosted decision trees using state-of-the-art frameworks.

Related papers

A Hybrid Swarm Intelligence Approach for Optimizing Multimodal Large Language Models Deployment in Edge-Cloud-based Federated Learning Environments [10.72166883797356]
Federated Learning (FL), Multimodal Large Language Models (MLLMs), and edge-cloud computing enables distributed and real-time data processing. We propose a novel hybrid framework wherein MLLMs are deployed on edge devices equipped with sufficient resources and battery life, while the majority of training occurs in the cloud. Our experimental results show that the proposed method significantly improves system performance, achieving an accuracy of 92%, reducing communication cost by 30%, and enhancing client participation.
arXiv Detail & Related papers (2025-02-04T03:03:24Z)
CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration [1.6021932740447968]
Large Language Models (LLMs) exhibit remarkable human-like predictive capabilities.<n>It is challenging to deploy LLMs to provide efficient and adaptive inference services at the edge.<n>This paper proposes a novel Cloud-Edge Collaboration framework for LLMs (CE-CoLLM) to tackle these challenges.
arXiv Detail & Related papers (2024-11-05T06:00:27Z)
Cloudy with a Chance of Anomalies: Dynamic Graph Neural Network for Early Detection of Cloud Services' User Anomalies [9.035212370386846]
This paper introduces a pioneering time-based embedding approach for Cloud Services Graph-based Anomaly Detection (CS-GAD) Our method employs a dynamic tripartite graph representation to encapsulate the evolving interactions among cloud services, users, and their activities over time. Results demonstrate a notable reduction in false positive rates (2-9%) compared to prevailing methods, coupled with a commendable true positive rate (100%)
arXiv Detail & Related papers (2024-09-19T12:50:31Z)
Point Cloud Compression with Implicit Neural Representations: A Unified Framework [54.119415852585306]
We present a pioneering point cloud compression framework capable of handling both geometry and attribute components. Our framework utilizes two coordinate-based neural networks to implicitly represent a voxelized point cloud. Our method exhibits high universality when contrasted with existing learning-based techniques.
arXiv Detail & Related papers (2024-05-19T09:19:40Z)
Efficient Cloud-edge Collaborative Inference for Object Re-identification [27.952445808987036]
We pioneer a cloud-edge collaborative inference framework for ReID systems. We propose a distribution-aware correlation modeling network (DaCM) to make the desired image return to the cloud server. DaCM embeds the spatial-temporal correlations implicitly included in the timestamps into a graph structure, and it can be applied in the cloud to regulate the size of the upload window.
arXiv Detail & Related papers (2024-01-04T02:56:50Z)
ECLM: Efficient Edge-Cloud Collaborative Learning with Continuous Environment Adaptation [47.35179593006409]
We propose ECLM, an edge-cloud collaborative learning framework for rapid model adaptation for dynamic edge environments. We show that ECLM significantly improves model performance (e.g., 18.89% accuracy increase) and resource efficiency (e.g. 7.12x communication cost reduction) in adapting models to dynamic edge environments.
arXiv Detail & Related papers (2023-11-18T14:10:09Z)
Deep Reinforcement Learning Based Resource Allocation for Cloud Native Wireless Network [20.377823731801456]
Cloud native technology has revolutionized 5G beyond and 6G communication networks, offering unprecedented levels of operational automation, flexibility, and adaptability. The vast array of cloud native services and applications presents a new challenge in resource allocation for dynamic cloud computing environments. We introduce deep reinforcement learning techniques and introduce two model-free algorithms capable of monitoring the network state and dynamically training allocation policies. Our findings demonstrate significant improvements in network efficiency, underscoring the potential of our proposed techniques in unlocking the full potential of cloud native wireless networks.
arXiv Detail & Related papers (2023-05-10T15:32:22Z)
Managing Cold-start in The Serverless Cloud with Temporal Convolutional Networks [0.0]
Serverless cloud is an innovative cloud service model that frees customers from most cloud management duties. A big threat to the serverless cloud's performance is cold-start, which is when the time of provisioning the needed cloud resource to serve customers' requests incurs unacceptable costs to the service providers and/or the customers. This paper proposes a novel low-coupling, high-cohesion ensemble policy that addresses the cold-start problem at infrastructure- and function-levels of the serverless cloud stack.
arXiv Detail & Related papers (2023-04-01T21:54:22Z)
An Efficient Split Fine-tuning Framework for Edge and Cloud Collaborative Learning [20.118073642453034]
We design an efficient split fine-tuning framework for edge and cloud collaborative learning. We compress the intermediate output of a neural network to reduce the communication volume between the edge device and the cloud server. Our framework can reduce the communication traffic by 96 times with little impact on the model accuracy.
arXiv Detail & Related papers (2022-11-30T02:55:21Z)
Device-Cloud Collaborative Recommendation via Meta Controller [65.97416287295152]
We propose a meta controller to dynamically manage the collaboration between the on-device recommender and the cloud-based recommender. On the basis of the counterfactual samples and the extended training, extensive experiments in the industrial recommendation scenarios show the promise of meta controller.
arXiv Detail & Related papers (2022-07-07T03:23:04Z)
Federated Dynamic Sparse Training: Computing Less, Communicating Less, Yet Learning Better [88.28293442298015]
Federated learning (FL) enables distribution of machine learning workloads from the cloud to resource-limited edge devices. We develop, implement, and experimentally validate a novel FL framework termed Federated Dynamic Sparse Training (FedDST) FedDST is a dynamic process that extracts and trains sparse sub-networks from the target full network.
arXiv Detail & Related papers (2021-12-18T02:26:38Z)
Auto-Split: A General Framework of Collaborative Edge-Cloud AI [49.750972428032355]
This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud. To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
arXiv Detail & Related papers (2021-08-30T08:03:29Z)
Device-Cloud Collaborative Learning for Recommendation [50.01289274123047]
We propose a novel MetaPatch learning approach on the device side to efficiently achieve "thousands of people with thousands of models" given a centralized cloud model. With billions of updated personalized device models, we propose a "model-over-models" distillation algorithm, namely MoMoDistill, to update the centralized cloud model.
arXiv Detail & Related papers (2021-04-14T05:06:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.