Related papers: Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

URL: http://arxiv.org/abs/2312.08361v1
Date: Wed, 13 Dec 2023 18:52:49 GMT
Title: Distributed Inference and Fine-tuning of Large Language Models Over The Internet
Authors: Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel
Abstract summary: Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
Score: 91.00270820533272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

Related papers

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models [3.4070166891274263]
Large language models (LLMs) solve natural language processing, complex reasoning, sentiment analysis and other tasks. These abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. We propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex.
arXiv Detail & Related papers (2024-05-28T17:20:44Z)
DiLoCo: Distributed Low-Communication Training of Language Models [32.15083548875492]
Large language models (LLM) have been a critical component in many applications of machine learning. Standard approaches to training LLM require a large number of interconnected accelerators. We propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected.
arXiv Detail & Related papers (2023-11-14T12:05:45Z)
RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs [32.01139974519813]
We present RedCoast, a tool crafted to automate distributed training and inference for large language models (LLMs) We also propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions. As a result, Redco implementations exhibit significantly fewer lines of code compared to their official counterparts.
arXiv Detail & Related papers (2023-10-25T04:32:35Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
Prune2Edge: A Multi-Phase Pruning Pipelines to Deep Ensemble Learning in IIoT [0.0]
We propose a novel edge-based multi-phase pruning pipelines to ensemble learning on IIoT devices. In the first phase, we generate a diverse ensemble of pruned models, then we apply integer quantisation, next we prune the generated ensemble using a clustering-based technique. Our proposed approach was able to outperform the predictability levels of a baseline model.
arXiv Detail & Related papers (2020-04-09T17:44:34Z)
LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition [33.581285906182075]
We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x.
arXiv Detail & Related papers (2020-03-13T19:52:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.