Related papers: Unicron: Economizing Self-Healing LLM Training at Scale

Unicron: Economizing Self-Healing LLM Training at Scale

URL: http://arxiv.org/abs/2401.00134v1
Date: Sat, 30 Dec 2023 04:06:16 GMT
Title: Unicron: Economizing Self-Healing LLM Training at Scale
Authors: Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren Zhou
Abstract summary: We introduce Unicron, a workload manager for efficient self-healing in large-scale language model training. Unicron minimizes failure-related costs across multiple concurrent tasks within a cluster. It demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods.
Score: 43.59768821780751
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.

Related papers

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z)
iScheduler: Reinforcement Learning-Driven Continual Optimization for Large-Scale Resource Investment Problems [30.109981943437006]
Scheduling precedence-constrained tasks under shared renewable resources is central to modern computing platforms.<n>We present iScheduler, a reinforcement-learning-driven iterative scheduling framework.<n>Experiments show that iScheduler attains competitive resource costs while reducing time to feasibility by up to 43$times$ against strong commercial baselines.
arXiv Detail & Related papers (2026-01-30T11:20:58Z)
Enhancing Cluster Scheduling in HPC: A Continuous Transfer Learning for Real-Time Optimization [0.42970700836450487]
This study presents a machine learning-assisted approach to optimize task scheduling in cluster systems, focusing on node-affinity constraints.<n>The proposed continuous transfer learning model evolves dynamically during operations, minimizing retraining needs.<n> Evaluated on Google Cluster Data, the model achieves over 99% accuracy, reducing computational overhead and improving scheduling latency for constrained tasks.
arXiv Detail & Related papers (2025-09-22T12:27:20Z)
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments [43.107261545706415]
Large Language Models (LLMs) have advanced rapidly but face significant memory demands. Current methods typically require lengthy training to alleviate the performance degradation from quantization loss. We make an initial attempt to extend the once-for-all framework to large language models.
arXiv Detail & Related papers (2024-05-30T16:05:15Z)
Federated Learning based on Pruning and Recovery [0.0]
This framework integrates asynchronous learning algorithms and pruning techniques. It addresses the inefficiencies of traditional federated learning algorithms in scenarios involving heterogeneous devices. It also tackles the staleness issue and inadequate training of certain clients in asynchronous algorithms.
arXiv Detail & Related papers (2024-03-16T14:35:03Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z)
Serverless Federated AUPRC Optimization for Multi-Party Collaborative Imbalanced Data Mining [119.89373423433804]
Area Under Precision-Recall (AUPRC) was introduced as an effective metric. Serverless multi-party collaborative training can cut down the communications cost by avoiding the server node bottleneck. We propose a new ServerLess biAsed sTochastic gradiEnt (SLATE) algorithm to directly optimize the AUPRC.
arXiv Detail & Related papers (2023-08-06T06:51:32Z)
Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation [23.018715954992352]
We present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations. Specifically, our framework assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph. In experiments on social and e-commerce networks with up to 1.3 billion edges, our proposed RandomTMA and SuperTMA approaches achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline.
arXiv Detail & Related papers (2023-05-17T01:49:44Z)
Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training [1.047192732651018]
We develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with 5% error.
arXiv Detail & Related papers (2023-03-12T13:42:39Z)
Time-sensitive Learning for Heterogeneous Federated Edge Intelligence [52.83633954857744]
We investigate real-time machine learning in a federated edge intelligence (FEI) system. FEI systems exhibit heterogenous communication and computational resource distribution. We propose a time-sensitive federated learning (TS-FL) framework to minimize the overall run-time for collaboratively training a shared ML model.
arXiv Detail & Related papers (2023-01-26T08:13:22Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
Efficient Device Scheduling with Multi-Job Federated Learning [64.21733164243781]
We propose a novel multi-job Federated Learning framework to enable the parallel training process of multiple jobs. We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost. Our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher)
arXiv Detail & Related papers (2021-12-11T08:05:11Z)
Dynamic Attention-based Communication-Efficient Federated Learning [85.18941440826309]
Federated learning (FL) offers a solution to train a global machine learning model. FL suffers performance degradation when client data distribution is non-IID. We propose a new adaptive training algorithm $textttAdaFL$ to combat this degradation.
arXiv Detail & Related papers (2021-08-12T14:18:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.