SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training
Architecture
- URL: http://arxiv.org/abs/2309.14148v1
- Date: Mon, 25 Sep 2023 14:01:35 GMT
- Title: SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training
Architecture
- Authors: Amine Barrak, Mayssa Jaziri, Ranim Trabelsi, Fehmi Jaafar, Fabio
Petrillo
- Abstract summary: SPIRT is a fault-tolerant, reliable, and secure serverless P2P ML training architecture.
This paper introduces SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture.
- Score: 0.61497722627646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of serverless computing has ushered in notable advancements in
distributed machine learning, particularly within parameter server-based
architectures. Yet, the integration of serverless features within peer-to-peer
(P2P) distributed networks remains largely uncharted. In this paper, we
introduce SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML
training architecture. designed to bridge this existing gap.
Capitalizing on the inherent robustness and reliability innate to P2P
systems, SPIRT employs RedisAI for in-database operations, leading to an 82\%
reduction in the time required for model updates and gradient averaging across
a variety of models and batch sizes. This architecture showcases resilience
against peer failures and adeptly manages the integration of new peers, thereby
highlighting its fault-tolerant characteristics and scalability. Furthermore,
SPIRT ensures secure communication between peers, enhancing the reliability of
distributed machine learning tasks. Even in the face of Byzantine attacks, the
system's robust aggregation algorithms maintain high levels of accuracy. These
findings illuminate the promising potential of serverless architectures in P2P
distributed machine learning, offering a significant stride towards the
development of more efficient, scalable, and resilient applications.
Related papers
- An Intelligent Native Network Slicing Security Architecture Empowered by Federated Learning [0.0]
We propose an architecture-intelligent security mechanism to improve the Network Slicing solutions.
We identify Distributed Denial-of-Service (DDoS) and intrusion attacks within the slice using generic and non-native telemetry records.
arXiv Detail & Related papers (2024-10-04T21:12:23Z) - Robust and Actively Secure Serverless Collaborative Learning [48.01929996757643]
Collaborative machine learning (ML) is widely used to enable institutions to learn better models from distributed data.
While collaborative approaches to learning intuitively protect user data, they remain vulnerable to either the server, the clients, or both.
We propose a peer-to-peer (P2P) learning scheme that is secure against malicious servers and robust to malicious clients.
arXiv Detail & Related papers (2023-10-25T14:43:03Z) - Exploring the Impact of Serverless Computing on Peer To Peer Training
Machine Learning [0.3441021278275805]
We introduce a novel architecture that combines serverless computing with P2P networks for distributed training.
Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods.
Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
arXiv Detail & Related papers (2023-09-25T13:51:07Z) - Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts [55.470959564665705]
Weight-sharing supernets are crucial for performance estimation in cutting-edge neural search frameworks.
The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models.
It excels in NAS for building memory-efficient task-agnostic BERT models.
arXiv Detail & Related papers (2023-06-08T00:35:36Z) - Architecting Peer-to-Peer Serverless Distributed Machine Learning
Training for Improved Fault Tolerance [1.495380389108477]
Serverless computing is a new paradigm for cloud computing that uses functions as a computational unit.
By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained.
We propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture.
arXiv Detail & Related papers (2023-02-27T17:38:47Z) - VeriCompress: A Tool to Streamline the Synthesis of Verified Robust
Compressed Neural Networks from Scratch [10.061078548888567]
AI's widespread integration has led to neural networks (NNs) deployment on edge and similar limited-resource platforms for safety-critical scenarios.
This study introduces VeriCompress, a tool that automates the search and training of compressed models with robustness guarantees.
The method trains models 2-3 times faster than the state-of-the-art approaches, surpassing relevant baseline approaches by average accuracy and robustness gains of 15.1 and 9.8 percentage points, respectively.
arXiv Detail & Related papers (2022-11-17T23:42:10Z) - FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning
Using Shared Data on the Server [64.94942635929284]
Federated Learning (FL) suffers from two critical challenges, i.e., limited computational resources and low training efficiency.
We propose a novel FL framework, FedDUAP, to exploit the insensitive data on the server and the decentralized data in edge devices.
By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4.8% higher), efficiency (up to 2.8 times faster), and computational cost (up to 61.9% smaller)
arXiv Detail & Related papers (2022-04-25T10:00:00Z) - RoFL: Attestable Robustness for Secure Federated Learning [59.63865074749391]
Federated Learning allows a large number of clients to train a joint model without the need to share their private data.
To ensure the confidentiality of the client updates, Federated Learning systems employ secure aggregation.
We present RoFL, a secure Federated Learning system that improves robustness against malicious clients.
arXiv Detail & Related papers (2021-07-07T15:42:49Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - A Privacy-Preserving Distributed Architecture for
Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service.
It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.