Related papers: Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance

Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance

URL: http://arxiv.org/abs/2302.13995v1
Date: Mon, 27 Feb 2023 17:38:47 GMT
Title: Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance
Authors: Amine Barrak, Fabio Petrillo, Fehmi Jaafar
Abstract summary: Serverless computing is a new paradigm for cloud computing that uses functions as a computational unit. By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained. We propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture.
Score: 1.495380389108477
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Distributed Machine Learning refers to the practice of training a model on multiple computers or devices that can be called nodes. Additionally, serverless computing is a new paradigm for cloud computing that uses functions as a computational unit. Serverless computing can be effective for distributed learning systems by enabling automated resource scaling, less manual intervention, and cost reduction. By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained. Several topologies of distributed machine learning have been established (centralized, parameter server, peer-to-peer). However, the parameter server architecture may have limitations in terms of fault tolerance, including a single point of failure and complex recovery processes. Moreover, training machine learning in a peer-to-peer (P2P) architecture can offer benefits in terms of fault tolerance by eliminating the single point of failure. In a P2P architecture, each node or worker can act as both a server and a client, which allows for more decentralized decision making and eliminates the need for a central coordinator. In this position paper, we propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture, focusing on cost reduction and fault tolerance.

Related papers

Learning to Help in Multi-Class Settings [11.972877486351987]
A hybrid system can be established by augmenting the local model with a server-side model. The proposed Learning to Help (L2H) model trains a server model given a fixed local (client) model. In both L2D and L2H, the training includes learning a rejector at the client to determine when to query the server.
arXiv Detail & Related papers (2025-01-23T16:32:01Z)
Communication Efficient ConFederated Learning: An Event-Triggered SAGA Approach [67.27031215756121]
Federated learning (FL) is a machine learning paradigm that targets model training without gathering the local data over various data sources. Standard FL, which employs a single server, can only support a limited number of users, leading to degraded learning capability. In this work, we consider a multi-server FL framework, referred to as emphConfederated Learning (CFL) in order to accommodate a larger number of users.
arXiv Detail & Related papers (2024-02-28T03:27:10Z)
Scalable Federated Unlearning via Isolated and Coded Sharding [76.12847512410767]
Federated unlearning has emerged as a promising paradigm to erase the client-level data effect. This paper proposes a scalable federated unlearning framework based on isolated sharding and coded computing.
arXiv Detail & Related papers (2024-01-29T08:41:45Z)
SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training Architecture [0.61497722627646]
SPIRT is a fault-tolerant, reliable, and secure serverless P2P ML training architecture. This paper introduces SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture.
arXiv Detail & Related papers (2023-09-25T14:01:35Z)
Exploring the Impact of Serverless Computing on Peer To Peer Training Machine Learning [0.3441021278275805]
We introduce a novel architecture that combines serverless computing with P2P networks for distributed training. Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods. Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
arXiv Detail & Related papers (2023-09-25T13:51:07Z)
Scalable Collaborative Learning via Representation Sharing [53.047460465980144]
Federated learning (FL) and Split Learning (SL) are two frameworks that enable collaborative learning while keeping the data private (on device) In FL, each data holder trains a model locally and releases it to a central server for aggregation. In SL, the clients must release individual cut-layer activations (smashed data) to the server and wait for its response (during both inference and back propagation). In this work, we present a novel approach for privacy-preserving machine learning, where the clients collaborate via online knowledge distillation using a contrastive loss.
arXiv Detail & Related papers (2022-11-20T10:49:22Z)
MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms [5.089110111757978]
Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Serverless computing has emerged in recent years to automate most infrastructure management tasks. We present ML Proxy, a reverse proxy to support efficient machine learning serving workloads on serverless computing systems.
arXiv Detail & Related papers (2022-02-23T00:27:49Z)
Federated Learning with Unreliable Clients: Performance Analysis and Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients. However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training. We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z)
Distributed Double Machine Learning with a Serverless Architecture [0.0]
This paper explores serverless cloud computing for double machine learning. Double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing.
arXiv Detail & Related papers (2021-01-11T16:58:30Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
Coded Federated Learning [5.375775284252717]
Federated learning is a method of training a global model from decentralized data distributed across client devices. Our results show that CFL allows the global model to converge nearly four times faster when compared to an uncoded approach.
arXiv Detail & Related papers (2020-02-21T23:06:20Z)
Dynamic Parameter Allocation in Parameter Servers [74.250687861348]
We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse. We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers.
arXiv Detail & Related papers (2020-02-03T11:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.