Architecting Peer-to-Peer Serverless Distributed Machine Learning
Training for Improved Fault Tolerance
- URL: http://arxiv.org/abs/2302.13995v1
- Date: Mon, 27 Feb 2023 17:38:47 GMT
- Title: Architecting Peer-to-Peer Serverless Distributed Machine Learning
Training for Improved Fault Tolerance
- Authors: Amine Barrak, Fabio Petrillo, Fehmi Jaafar
- Abstract summary: Serverless computing is a new paradigm for cloud computing that uses functions as a computational unit.
By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained.
We propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture.
- Score: 1.495380389108477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed Machine Learning refers to the practice of training a model on
multiple computers or devices that can be called nodes. Additionally,
serverless computing is a new paradigm for cloud computing that uses functions
as a computational unit. Serverless computing can be effective for distributed
learning systems by enabling automated resource scaling, less manual
intervention, and cost reduction. By distributing the workload, distributed
machine learning can speed up the training process and allow more complex
models to be trained. Several topologies of distributed machine learning have
been established (centralized, parameter server, peer-to-peer). However, the
parameter server architecture may have limitations in terms of fault tolerance,
including a single point of failure and complex recovery processes. Moreover,
training machine learning in a peer-to-peer (P2P) architecture can offer
benefits in terms of fault tolerance by eliminating the single point of
failure. In a P2P architecture, each node or worker can act as both a server
and a client, which allows for more decentralized decision making and
eliminates the need for a central coordinator. In this position paper, we
propose exploring the use of serverless computing in distributed machine
learning training and comparing the performance of P2P architecture with the
parameter server architecture, focusing on cost reduction and fault tolerance.
Related papers
- Communication Efficient ConFederated Learning: An Event-Triggered SAGA
Approach [67.27031215756121]
Federated learning (FL) is a machine learning paradigm that targets model training without gathering the local data over various data sources.
Standard FL, which employs a single server, can only support a limited number of users, leading to degraded learning capability.
In this work, we consider a multi-server FL framework, referred to as emphConfederated Learning (CFL) in order to accommodate a larger number of users.
arXiv Detail & Related papers (2024-02-28T03:27:10Z) - Scalable Federated Unlearning via Isolated and Coded Sharding [76.12847512410767]
Federated unlearning has emerged as a promising paradigm to erase the client-level data effect.
This paper proposes a scalable federated unlearning framework based on isolated sharding and coded computing.
arXiv Detail & Related papers (2024-01-29T08:41:45Z) - SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training
Architecture [0.61497722627646]
SPIRT is a fault-tolerant, reliable, and secure serverless P2P ML training architecture.
This paper introduces SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture.
arXiv Detail & Related papers (2023-09-25T14:01:35Z) - Exploring the Impact of Serverless Computing on Peer To Peer Training
Machine Learning [0.3441021278275805]
We introduce a novel architecture that combines serverless computing with P2P networks for distributed training.
Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods.
Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
arXiv Detail & Related papers (2023-09-25T13:51:07Z) - Scalable Collaborative Learning via Representation Sharing [53.047460465980144]
Federated learning (FL) and Split Learning (SL) are two frameworks that enable collaborative learning while keeping the data private (on device)
In FL, each data holder trains a model locally and releases it to a central server for aggregation.
In SL, the clients must release individual cut-layer activations (smashed data) to the server and wait for its response (during both inference and back propagation).
In this work, we present a novel approach for privacy-preserving machine learning, where the clients collaborate via online knowledge distillation using a contrastive loss.
arXiv Detail & Related papers (2022-11-20T10:49:22Z) - MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving
on Serverless Computing Platforms [5.089110111757978]
Serving machine learning inference workloads on the cloud is still a challenging task on the production level.
Serverless computing has emerged in recent years to automate most infrastructure management tasks.
We present ML Proxy, a reverse proxy to support efficient machine learning serving workloads on serverless computing systems.
arXiv Detail & Related papers (2022-02-23T00:27:49Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - Distributed Double Machine Learning with a Serverless Architecture [0.0]
This paper explores serverless cloud computing for double machine learning.
Double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing.
arXiv Detail & Related papers (2021-01-11T16:58:30Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z) - Coded Federated Learning [5.375775284252717]
Federated learning is a method of training a global model from decentralized data distributed across client devices.
Our results show that CFL allows the global model to converge nearly four times faster when compared to an uncoded approach.
arXiv Detail & Related papers (2020-02-21T23:06:20Z) - Dynamic Parameter Allocation in Parameter Servers [74.250687861348]
We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse.
We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers.
arXiv Detail & Related papers (2020-02-03T11:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.