Related papers: Photon: Federated LLM Pre-Training

Photon: Federated LLM Pre-Training

URL: http://arxiv.org/abs/2411.02908v1
Date: Tue, 05 Nov 2024 08:48:25 GMT
Title: Photon: Federated LLM Pre-Training
Authors: Lorenzo Sani, Alex Iacob, Zeyu Cao, Royson Lee, Bill Marino, Yan Gao, Dongqi Cai, Zexi Li, Wanru Zhao, Xinchi Qiu, Nicholas D. Lane,
Abstract summary: We introduce Photon, the first complete system for federated end-to-end LLM training. We show that Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training.
Score: 17.368070785118654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly-connected GPUs if they can effectively be used for pre-training. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch. We show that: (1) Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to centralized; and (3) Photon outperforms the wall-time of baseline distributed training methods by 35% via communicating 64x-512xless. Our proposal is robust to data heterogeneity and converges twice as fast as previous methods like DiLoCo. This surprising data efficiency stems from a unique approach combining small client batch sizes with extremely high learning rates, enabled by federated averaging's robustness to hyperparameters. Photon thus represents the first economical system for global internet-wide LLM pre-training.

Related papers

CELLM: An Efficient Communication in Large Language Models Training for Federated Learning [0.0]
This thesis aims to develop efficient training methods for large language models (LLMs) in Federated Learning (FL) First, we use low-rank adaptation (LoRA) to reduce the computational load of local model training. Second, we communicate sparse updates throughout training to significantly cut down on communication costs.
arXiv Detail & Related papers (2024-07-30T05:24:08Z)
A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling. A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs. In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z)
The Future of Large Language Model Pre-training is Federated [15.237418036900582]
We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm for LLM pre-training. We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources for pre-training LLMs with billions of parameters. We further show the effectiveness of the federated training scales with model size and present our approach for training billion-scale federated LLMs using limited resources.
arXiv Detail & Related papers (2024-05-17T15:27:52Z)
FedMS: Federated Learning with Mixture of Sparsely Activated Foundations Models [11.362085734837217]
We propose a novel two-stage federated learning algorithm called FedMS. A global expert is trained in the first stage and a local expert is trained in the second stage to provide better personalization. We employ extensive experiments to verify the effectiveness of FedMS, results show that FedMS outperforms other SOTA baselines by up to 55.25% in default settings.
arXiv Detail & Related papers (2023-12-26T07:40:26Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems [6.8519529064678375]
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities.
arXiv Detail & Related papers (2023-10-04T13:00:53Z)
Semi-Federated Learning: Convergence Analysis and Optimization of A Hybrid Learning Framework [70.83511997272457]
We propose a semi-federated learning (SemiFL) paradigm to leverage both the base station (BS) and devices for a hybrid implementation of centralized learning (CL) and FL. We propose a two-stage algorithm to solve this intractable problem, in which we provide the closed-form solutions to the beamformers.
arXiv Detail & Related papers (2023-10-04T03:32:39Z)
FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method. We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate. We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z)
FedDBL: Communication and Data Efficient Federated Deep-Broad Learning for Histopathological Tissue Classification [65.7405397206767]
We propose Federated Deep-Broad Learning (FedDBL) to achieve superior classification performance with limited training samples and only one-round communication. FedDBL greatly outperforms the competitors with only one-round communication and limited training samples, while it even achieves comparable performance with the ones under multiple-round communications. Since no data or deep model sharing across different clients, the privacy issue is well-solved and the model security is guaranteed with no model inversion attack risk.
arXiv Detail & Related papers (2023-02-24T14:27:41Z)
Conquering the Communication Constraints to Enable Large Pre-Trained Models in Federated Learning [18.12162136918301]
Federated learning (FL) has emerged as a promising paradigm for enabling the collaborative training of models without centralized access to the raw data on local devices. Recent state-of-the-art pre-trained models are getting more capable but also have more parameters. Can we find a solution to enable those strong and readily-available pre-trained models in FL to achieve excellent performance while simultaneously reducing the communication burden? Specifically, we systemically evaluate the performance of FedPEFT across a variety of client stability, data distribution, and differential privacy settings.
arXiv Detail & Related papers (2022-10-04T16:08:54Z)
Decentralized Training of Foundation Models in Heterogeneous Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive. We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z)
Communication-Efficient Federated Learning with Dual-Side Low-Rank Compression [8.353152693578151]
Federated learning (FL) is a promising and powerful approach for training deep learning models without sharing the raw data of clients. We propose a new training method, referred to as federated learning with dual-side low-rank compression (FedDLR) We show that FedDLR outperforms the state-of-the-art solutions in terms of both the communication and efficiency.
arXiv Detail & Related papers (2021-04-26T09:13:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.