Related papers: Distributed Learning and Inference Systems: A Networking Perspective

Distributed Learning and Inference Systems: A Networking Perspective

URL: http://arxiv.org/abs/2501.05323v1
Date: Thu, 09 Jan 2025 15:48:29 GMT
Title: Distributed Learning and Inference Systems: A Networking Perspective
Authors: Hesham G. Moussa, Arashmid Akhavain, S. Maryam Hosseini, Bill McCormick,
Abstract summary: This work proposes a novel framework, Data and Dynamics-Aware Inference and Training Networks (DA-ITN)<n>The different components of DA-ITN and their functions are explored, and the associated challenges and research areas are highlighted.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Machine learning models have achieved, and in some cases surpassed, human-level performance in various tasks, mainly through centralized training of static models and the use of large models stored in centralized clouds for inference. However, this centralized approach has several drawbacks, including privacy concerns, high storage demands, a single point of failure, and significant computing requirements. These challenges have driven interest in developing alternative decentralized and distributed methods for AI training and inference. Distribution introduces additional complexity, as it requires managing multiple moving parts. To address these complexities and fill a gap in the development of distributed AI systems, this work proposes a novel framework, Data and Dynamics-Aware Inference and Training Networks (DA-ITN). The different components of DA-ITN and their functions are explored, and the associated challenges and research areas are highlighted.

Related papers

A study on performance limitations in Federated Learning [0.05439020425819]
This project focuses on the communication bottleneck and data Non IID-ness, and its effect on the performance of the models.<n>Google introduced Federated Learning in 2016.<n>This project will be focusing on the communication bottleneck and data Non IID-ness, and its effect on the performance of the models.
arXiv Detail & Related papers (2025-01-07T02:35:41Z)
Model Agnostic Hybrid Sharding For Heterogeneous Distributed Inference [11.39873199479642]
Nesa introduces a model-agnostic sharding framework designed for decentralized AI inference. Our framework uses blockchain-based deep neural network sharding to distribute computational tasks across a diverse network of nodes. Our results highlight the potential to democratize access to cutting-edge AI technologies.
arXiv Detail & Related papers (2024-07-29T08:18:48Z)
A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings [1.0589208420411014]
This survey explores the landscape of distributed learning, encompassing cloud and edge settings. We delve into the core concepts of data and model parallelism, examining how models are partitioned across different dimensions and layers to optimize resource utilization and performance. We analyze various partitioning schemes for different layer types, including fully connected, convolutional, and recurrent layers, highlighting the trade-offs between computational efficiency, communication overhead, and memory constraints.
arXiv Detail & Related papers (2024-05-23T22:00:38Z)
Decentralized Learning Made Easy with DecentralizePy [3.1848820580333737]
Decentralized learning (DL) has gained prominence for its potential benefits in terms of scalability, privacy, and fault tolerance. We propose DecentralizePy, a distributed framework for decentralized ML, which allows for the emulation of large-scale learning networks in arbitrary topologies. We demonstrate the capabilities of DecentralizePy by deploying techniques such as sparsification and secure aggregation on top of several topologies.
arXiv Detail & Related papers (2023-04-17T14:42:33Z)
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training [42.514897110537596]
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task. We introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.
arXiv Detail & Related papers (2022-11-30T00:32:37Z)
Multi-Resource Allocation for On-Device Distributed Federated Learning Systems [79.02994855744848]
This work poses a distributed multi-resource allocation scheme for minimizing the weighted sum of latency and energy consumption in the on-device distributed federated learning (FL) system. Each mobile device in the system engages the model training process within the specified area and allocates its computation and communication resources for deriving and uploading parameters, respectively.
arXiv Detail & Related papers (2022-11-01T14:16:05Z)
Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC. We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)
Decentralized Local Stochastic Extra-Gradient for Variational Inequalities [125.62877849447729]
We consider distributed variational inequalities (VIs) on domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that covers the settings of fully decentralized calculations. We theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone settings.
arXiv Detail & Related papers (2021-06-15T17:45:51Z)
Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data [77.88594632644347]
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks. In realistic learning scenarios, the presence of heterogeneity across different clients' local datasets poses an optimization challenge. We propose a novel momentum-based method to mitigate this decentralized training difficulty.
arXiv Detail & Related papers (2021-02-09T11:27:14Z)
Decentralized Federated Learning Preserves Model and Data Privacy [77.454688257702]
We propose a fully decentralized approach, which allows to share knowledge between trained models. Students are trained on the output of their teachers via synthetically generated input data. The results show that an untrained student model, trained on the teachers output reaches comparable F1-scores as the teacher.
arXiv Detail & Related papers (2021-02-01T14:38:54Z)
Distributed Training of Deep Learning Models: A Taxonomic Perspective [11.924058430461216]
Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines.
arXiv Detail & Related papers (2020-07-08T08:56:58Z)
Self-organizing Democratized Learning: Towards Large-scale Distributed Learning Systems [71.14339738190202]
democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems. Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper. The proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms.
arXiv Detail & Related papers (2020-07-07T08:34:48Z)
Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC. To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.