Related papers: SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training Architecture

SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training Architecture

URL: http://arxiv.org/abs/2309.14148v1
Date: Mon, 25 Sep 2023 14:01:35 GMT
Title: SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training Architecture
Authors: Amine Barrak, Mayssa Jaziri, Ranim Trabelsi, Fehmi Jaafar, Fabio Petrillo
Abstract summary: SPIRT is a fault-tolerant, reliable, and secure serverless P2P ML training architecture. This paper introduces SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture.
Score: 0.61497722627646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advent of serverless computing has ushered in notable advancements in distributed machine learning, particularly within parameter server-based architectures. Yet, the integration of serverless features within peer-to-peer (P2P) distributed networks remains largely uncharted. In this paper, we introduce SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture. designed to bridge this existing gap. Capitalizing on the inherent robustness and reliability innate to P2P systems, SPIRT employs RedisAI for in-database operations, leading to an 82\% reduction in the time required for model updates and gradient averaging across a variety of models and batch sizes. This architecture showcases resilience against peer failures and adeptly manages the integration of new peers, thereby highlighting its fault-tolerant characteristics and scalability. Furthermore, SPIRT ensures secure communication between peers, enhancing the reliability of distributed machine learning tasks. Even in the face of Byzantine attacks, the system's robust aggregation algorithms maintain high levels of accuracy. These findings illuminate the promising potential of serverless architectures in P2P distributed machine learning, offering a significant stride towards the development of more efficient, scalable, and resilient applications.

Related papers

PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints. PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint. evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z)
An Intelligent Native Network Slicing Security Architecture Empowered by Federated Learning [0.0]
We propose an architecture-intelligent security mechanism to improve the Network Slicing solutions. We identify Distributed Denial-of-Service (DDoS) and intrusion attacks within the slice using generic and non-native telemetry records.
arXiv Detail & Related papers (2024-10-04T21:12:23Z)
Robust and Actively Secure Serverless Collaborative Learning [48.01929996757643]
Collaborative machine learning (ML) is widely used to enable institutions to learn better models from distributed data. While collaborative approaches to learning intuitively protect user data, they remain vulnerable to either the server, the clients, or both. We propose a peer-to-peer (P2P) learning scheme that is secure against malicious servers and robust to malicious clients.
arXiv Detail & Related papers (2023-10-25T14:43:03Z)
Exploring the Impact of Serverless Computing on Peer To Peer Training Machine Learning [0.3441021278275805]
We introduce a novel architecture that combines serverless computing with P2P networks for distributed training. Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods. Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
arXiv Detail & Related papers (2023-09-25T13:51:07Z)
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts [55.470959564665705]
Weight-sharing supernets are crucial for performance estimation in cutting-edge neural search frameworks. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models. It excels in NAS for building memory-efficient task-agnostic BERT models.
arXiv Detail & Related papers (2023-06-08T00:35:36Z)
Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance [1.495380389108477]
Serverless computing is a new paradigm for cloud computing that uses functions as a computational unit. By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained. We propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture.
arXiv Detail & Related papers (2023-02-27T17:38:47Z)
VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch [10.061078548888567]
AI's widespread integration has led to neural networks (NNs) deployment on edge and similar limited-resource platforms for safety-critical scenarios. This study introduces VeriCompress, a tool that automates the search and training of compressed models with robustness guarantees. The method trains models 2-3 times faster than the state-of-the-art approaches, surpassing relevant baseline approaches by average accuracy and robustness gains of 15.1 and 9.8 percentage points, respectively.
arXiv Detail & Related papers (2022-11-17T23:42:10Z)
FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning Using Shared Data on the Server [64.94942635929284]
Federated Learning (FL) suffers from two critical challenges, i.e., limited computational resources and low training efficiency. We propose a novel FL framework, FedDUAP, to exploit the insensitive data on the server and the decentralized data in edge devices. By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4.8% higher), efficiency (up to 2.8 times faster), and computational cost (up to 61.9% smaller)
arXiv Detail & Related papers (2022-04-25T10:00:00Z)
RoFL: Attestable Robustness for Secure Federated Learning [59.63865074749391]
Federated Learning allows a large number of clients to train a joint model without the need to share their private data. To ensure the confidentiality of the client updates, Federated Learning systems employ secure aggregation. We present RoFL, a secure Federated Learning system that improves robustness against malicious clients.
arXiv Detail & Related papers (2021-07-07T15:42:49Z)
Federated Learning with Unreliable Clients: Performance Analysis and Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients. However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training. We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z)
A Privacy-Preserving Distributed Architecture for Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service. It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.