Distributed Inference and Fine-tuning of Large Language Models Over The
Internet
- URL: http://arxiv.org/abs/2312.08361v1
- Date: Wed, 13 Dec 2023 18:52:49 GMT
- Title: Distributed Inference and Fine-tuning of Large Language Models Over The
Internet
- Authors: Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk,
Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel
- Abstract summary: Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
- Score: 91.00270820533272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are useful in many NLP tasks and become more
capable with size, with the best open-source models having over 50 billion
parameters. However, using these 50B+ models requires high-end hardware, making
them inaccessible to most researchers. In this work, we investigate methods for
cost-efficient inference and fine-tuning of LLMs, comparing local and
distributed strategies. We observe that a large enough model (50B+) can run
efficiently even on geodistributed devices in a consumer-grade network. This
could allow running LLM efficiently by pooling together idle compute resources
of multiple research groups and volunteers. We address two open problems: (1)
how to perform inference and fine-tuning reliably if any device can disconnect
abruptly and (2) how to partition LLMs between devices with uneven hardware,
joining and leaving at will. In order to do that, we develop special
fault-tolerant inference algorithms and load-balancing protocols that
automatically assign devices to maximize the total system throughput. We
showcase these algorithms in Petals - a decentralized system that runs Llama 2
(70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for
interactive generation. We evaluate the performance of our system in simulated
conditions and a real-world setup spanning two continents.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models [3.4070166891274263]
Large language models (LLMs) solve natural language processing, complex reasoning, sentiment analysis and other tasks.
These abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms.
We propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS.
We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex.
arXiv Detail & Related papers (2024-05-28T17:20:44Z) - DiLoCo: Distributed Low-Communication Training of Language Models [32.15083548875492]
Large language models (LLM) have been a critical component in many applications of machine learning.
Standard approaches to training LLM require a large number of interconnected accelerators.
We propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected.
arXiv Detail & Related papers (2023-11-14T12:05:45Z) - RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs [32.01139974519813]
We present RedCoast, a tool crafted to automate distributed training and inference for large language models (LLMs)
We also propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions.
As a result, Redco implementations exhibit significantly fewer lines of code compared to their official counterparts.
arXiv Detail & Related papers (2023-10-25T04:32:35Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Prune2Edge: A Multi-Phase Pruning Pipelines to Deep Ensemble Learning in
IIoT [0.0]
We propose a novel edge-based multi-phase pruning pipelines to ensemble learning on IIoT devices.
In the first phase, we generate a diverse ensemble of pruned models, then we apply integer quantisation, next we prune the generated ensemble using a clustering-based technique.
Our proposed approach was able to outperform the predictability levels of a baseline model.
arXiv Detail & Related papers (2020-04-09T17:44:34Z) - LCP: A Low-Communication Parallelization Method for Fast Neural Network
Inference in Image Recognition [33.581285906182075]
We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches.
We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards.
LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x.
arXiv Detail & Related papers (2020-03-13T19:52:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.