FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs
- URL: http://arxiv.org/abs/2309.01172v1
- Date: Sun, 3 Sep 2023 13:27:56 GMT
- Title: FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs
- Authors: Zhenheng Tang, Yuxin Wang, Xin He, Longteng Zhang, Xinglin Pan, Qiang
Wang, Rongfei Zeng, Kaiyong Zhao, Shaohuai Shi, Bingsheng He, Xiaowen Chu
- Abstract summary: We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
- Score: 57.12856172329322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of memory and computation requirements of large language
models (LLMs) has outpaced the development of hardware, hindering people who
lack large-scale high-end GPUs from training or deploying LLMs. However,
consumer-level GPUs, which constitute a larger market share, are typically
overlooked in LLM due to their weaker computing performance, smaller storage
capacity, and lower communication bandwidth. Additionally, users may have
privacy concerns when interacting with remote LLMs. In this paper, we envision
a decentralized system unlocking the potential vast untapped consumer-level
GPUs in pre-training, inference and fine-tuning of LLMs with privacy
protection. However, this system faces critical challenges, including limited
CPU and GPU memory, low network bandwidth, the variability of peer and device
heterogeneity. To address these challenges, our system design incorporates: 1)
a broker with backup pool to implement dynamic join and quit of computing
providers; 2) task scheduling with hardware performance to improve system
efficiency; 3) abstracting ML procedures into directed acyclic graphs (DAGs) to
achieve model and task universality; 4) abstracting intermediate represention
and execution planes to ensure compatibility of various devices and deep
learning (DL) frameworks. Our performance analysis demonstrates that 50 RTX
3080 GPUs can achieve throughputs comparable to those of 4 H100 GPUs, which are
significantly more expensive.
Related papers
- DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - TDML -- A Trustworthy Distributed Machine Learning Framework [7.302091381583343]
The rapid advancement of large models (LM) has intensified the demand for computing resources.
This demand is exacerbated by limited availability due to supply chain delays and monopolistic acquisition by major tech firms.
We propose a textittrustworthy distributed machine learning (TDML) framework that leverages guidance to coordinate remote trainers and validate workloads.
arXiv Detail & Related papers (2024-07-10T03:22:28Z) - Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms [4.959530958049395]
We develop a pipeline to Characterize and predict the training performance of modern machine learning (ML) workloads on compute systems.
Our pipeline generalizes to other types of ML workloads, such as Transformer-based NLP models.
It is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration.
arXiv Detail & Related papers (2024-04-19T07:20:33Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - Multi-model Machine Learning Inference Serving with GPU Spatial
Partitioning [7.05946599544139]
High throughput machine learning (ML) inference servers are critical for online service applications.
These servers must provide a bounded latency for each request to support consistent service-level objective (SLO)
This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
arXiv Detail & Related papers (2021-09-01T04:46:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.