Unicron: Economizing Self-Healing LLM Training at Scale
- URL: http://arxiv.org/abs/2401.00134v1
- Date: Sat, 30 Dec 2023 04:06:16 GMT
- Title: Unicron: Economizing Self-Healing LLM Training at Scale
- Authors: Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren
Zhou
- Abstract summary: We introduce Unicron, a workload manager for efficient self-healing in large-scale language model training.
Unicron minimizes failure-related costs across multiple concurrent tasks within a cluster.
It demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods.
- Score: 43.59768821780751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large-scale language models is increasingly critical in various
domains, but it is hindered by frequent failures, leading to significant time
and economic costs. Current failure recovery methods in cloud-based settings
inadequately address the diverse and complex scenarios that arise, focusing
narrowly on erasing downtime for individual tasks without considering the
overall cost impact on a cluster. We introduce Unicron, a workload manager
designed for efficient self-healing in large-scale language model training.
Unicron optimizes the training process by minimizing failure-related costs
across multiple concurrent tasks within a cluster. Its key features include
in-band error detection for real-time error identification without extra
overhead, a dynamic cost-aware plan generation mechanism for optimal
reconfiguration, and an efficient transition strategy to reduce downtime during
state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates
up to a 1.9x improvement in training efficiency over state-of-the-art methods,
significantly reducing failure recovery costs and enhancing the reliability of
large-scale language model training.
Related papers
- One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments [43.107261545706415]
Large Language Models (LLMs) have advanced rapidly but face significant memory demands.
Current methods typically require lengthy training to alleviate the performance degradation from quantization loss.
We make an initial attempt to extend the once-for-all framework to large language models.
arXiv Detail & Related papers (2024-05-30T16:05:15Z) - Federated Learning based on Pruning and Recovery [0.0]
This framework integrates asynchronous learning algorithms and pruning techniques.
It addresses the inefficiencies of traditional federated learning algorithms in scenarios involving heterogeneous devices.
It also tackles the staleness issue and inadequate training of certain clients in asynchronous algorithms.
arXiv Detail & Related papers (2024-03-16T14:35:03Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters.
We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z) - Serverless Federated AUPRC Optimization for Multi-Party Collaborative
Imbalanced Data Mining [119.89373423433804]
Area Under Precision-Recall (AUPRC) was introduced as an effective metric.
Serverless multi-party collaborative training can cut down the communications cost by avoiding the server node bottleneck.
We propose a new ServerLess biAsed sTochastic gradiEnt (SLATE) algorithm to directly optimize the AUPRC.
arXiv Detail & Related papers (2023-08-06T06:51:32Z) - Simplifying Distributed Neural Network Training on Massive Graphs:
Randomized Partitions Improve Model Aggregation [23.018715954992352]
We present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations.
Specifically, our framework assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph.
In experiments on social and e-commerce networks with up to 1.3 billion edges, our proposed RandomTMA and SuperTMA approaches achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline.
arXiv Detail & Related papers (2023-05-17T01:49:44Z) - Scavenger: A Cloud Service for Optimizing Cost and Performance of ML
Training [1.047192732651018]
We develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud.
By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with 5% error.
arXiv Detail & Related papers (2023-03-12T13:42:39Z) - Time-sensitive Learning for Heterogeneous Federated Edge Intelligence [52.83633954857744]
We investigate real-time machine learning in a federated edge intelligence (FEI) system.
FEI systems exhibit heterogenous communication and computational resource distribution.
We propose a time-sensitive federated learning (TS-FL) framework to minimize the overall run-time for collaboratively training a shared ML model.
arXiv Detail & Related papers (2023-01-26T08:13:22Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Efficient Device Scheduling with Multi-Job Federated Learning [64.21733164243781]
We propose a novel multi-job Federated Learning framework to enable the parallel training process of multiple jobs.
We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost.
Our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher)
arXiv Detail & Related papers (2021-12-11T08:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.