On Harnessing Idle Compute at the Edge for Foundation Model Training
- URL: http://arxiv.org/abs/2512.22142v1
- Date: Sat, 13 Dec 2025 20:57:43 GMT
- Title: On Harnessing Idle Compute at the Edge for Foundation Model Training
- Authors: Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina,
- Abstract summary: We introduce Cleave, which finely partitions training operations through a novel selective hybrid tensor parallelism method.<n>Cleave matches cloud-based GPU training by scaling efficiently to larger models and thousands of devices, supporting up to 8x more devices than baseline edge-training approaches.<n>It outperforms state-of-the-art edge training methods by up to a factor of 10 in per-batch training time and efficiently handles device failures, achieving at least 100x faster recovery than prior methods.
- Score: 7.228241542082645
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The ecosystem behind foundation model development today is highly centralized and limited to large-scale cloud data center operators: training foundation models is costly, needing immense compute resources. Decentralized foundation model training across edge devices, leveraging their spare compute, promises a democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-based training performance, exhibit limited scalability with model size, exceed device memory capacity, and have prohibitive communication overhead. They also fail to satisfactorily handle device heterogeneity and dynamism. We introduce a new paradigm, Cleave, which finely partitions training operations through a novel selective hybrid tensor parallelism method. Together with a parameter server centric training framework, Cleave copes with device memory limits and avoids communication bottlenecks, thereby enabling efficient training of large models on par with the cloud. Further, with a cost optimization model to guide device selection and training workload distribution, Cleave effectively accounts for device heterogeneity and churn. Our evaluations show that Cleave matches cloud-based GPU training by scaling efficiently to larger models and thousands of devices, supporting up to 8x more devices than baseline edge-training approaches. It outperforms state-of-the-art edge training methods by up to a factor of 10 in per-batch training time and efficiently handles device failures, achieving at least 100x faster recovery than prior methods.
Related papers
- DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z) - Data-Efficient Stream-Based Active Distillation for Scalable Edge Model Deployment [16.22309785621152]
This work explores how to select the most useful images for training to maximize model quality while keeping transmission costs low.<n>Our work shows that, for a similar training load (i.e., iterations), a high-confidence stream-based strategy coupled with a diversity-based approach produces a high-quality model with minimal dataset queries.
arXiv Detail & Related papers (2025-09-24T18:52:34Z) - AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z) - OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance [67.37017498784748]
Large-scale 3D parallel training on vision-language instruction-tuning models leads to an imbalanced computation load across different devices.<n>We rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices.<n>Our method's efficacy and generalizability are further validated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks [8.049237611207113]
We show how potential race conditions in the hierarchical partitioning (hpZ) scheme cause instability when training models with billions of parameters.
We then propose a modification to the partitioning algorithm that addresses these convergence challenges while maintaining competitive training efficiency.
The updated algorithm enables robust training of larger models with 98% throughput and model training speed improvement without sacrificing the quality of convergence.
arXiv Detail & Related papers (2024-06-28T01:46:10Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Aggregating Capacity in FL through Successive Layer Training for
Computationally-Constrained Devices [3.4530027457862]
Federated learning (FL) is usually performed on resource-constrained edge devices.
FL training process should be adjusted to such constraints.
We propose a new method that enables successive freezing and training of the parameters of the FL model at devices.
arXiv Detail & Related papers (2023-05-26T15:04:06Z) - ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning.
ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models.
Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z) - FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training
Framework for Heterogeneous Edge Devices [21.513786638743234]
FTPipeHD is a novel framework that trains deep learning models across heterogeneous devices.
It is shown that FTPipeHD is 6.8x faster in training than the state of the art method when the computing capacity of the best device is 10x greater than the worst one.
arXiv Detail & Related papers (2021-10-06T14:00:22Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.