TDML -- A Trustworthy Distributed Machine Learning Framework
- URL: http://arxiv.org/abs/2407.07339v1
- Date: Wed, 10 Jul 2024 03:22:28 GMT
- Title: TDML -- A Trustworthy Distributed Machine Learning Framework
- Authors: Zhen Wang, Qin Wang, Guangsheng Yu, Shiping Chen,
- Abstract summary: The rapid advancement of large models (LM) has intensified the demand for computing resources.
This demand is exacerbated by limited availability due to supply chain delays and monopolistic acquisition by major tech firms.
We propose a textittrustworthy distributed machine learning (TDML) framework that leverages guidance to coordinate remote trainers and validate workloads.
- Score: 7.302091381583343
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent years have witnessed a surge in deep learning research, marked by the introduction of expansive generative models like OpenAI's SORA and GPT, Meta AI's LLAMA series, and Google's FLAN, BART, and Gemini models. However, the rapid advancement of large models (LM) has intensified the demand for computing resources, particularly GPUs, which are crucial for their parallel processing capabilities. This demand is exacerbated by limited GPU availability due to supply chain delays and monopolistic acquisition by major tech firms. Distributed Machine Learning (DML) methods, such as Federated Learning (FL), mitigate these challenges by partitioning data and models across multiple servers, though implementing optimizations like tensor and pipeline parallelism remains complex. Blockchain technology emerges as a promising solution, ensuring data integrity, scalability, and trust in distributed computing environments, but still lacks guidance on building practical DML systems. In this paper, we propose a \textit{trustworthy distributed machine learning} (TDML) framework that leverages blockchain to coordinate remote trainers and validate workloads, achieving privacy, transparency, and efficient model training across public remote computing resources. Experimental validation demonstrates TDML's efficacy in overcoming performance limitations and malicious node detection, positioning it as a robust solution for scalable and secure distributed machine learning.
Related papers
- MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - A Blockchain-empowered Multi-Aggregator Federated Learning Architecture
in Edge Computing with Deep Reinforcement Learning Optimization [8.082460100928358]
Federated learning (FL) is emerging as a sought-after distributed machine learning architecture.
With advancements in network infrastructure, FL has been seamlessly integrated into edge computing.
While blockchain technology promises to bolster security, practical deployment on resource-constrained edge devices remains a challenge.
arXiv Detail & Related papers (2023-10-14T20:47:30Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Experimenting with Emerging RISC-V Systems for Decentralised Machine
Learning [12.18598759507803]
Decentralised Machine Learning (DML) enables collaborative machine learning without centralised input data.
We map DML schemes to an underlying parallel programming library.
We experiment with it by generating different working DML schemes on x86-64 and ARM platforms and an emerging RISC-V one.
As a byproduct, we introduce a RISC-V porting of the PyTorch framework, the first publicly available to our knowledge.
arXiv Detail & Related papers (2023-02-15T20:57:42Z) - Latency Optimization for Blockchain-Empowered Federated Learning in
Multi-Server Edge Computing [24.505675843652448]
In this paper, we study a new latency optimization problem for federated learning (BFL) in multi-server edge computing.
In this system model, distributed mobile devices (MDs) communicate with a set of edge servers (ESs) to handle both machine learning (ML) model training and block mining simultaneously.
arXiv Detail & Related papers (2022-03-18T00:38:29Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Clairvoyant Prefetching for Distributed Machine Learning I/O [9.490118207943192]
I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers.
We produce a novel machine learning I/O, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, scalable solution that delivers better performance than state-of-the-art approaches.
arXiv Detail & Related papers (2021-01-21T17:21:42Z) - Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G
Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC.
To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.