TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
- URL: http://arxiv.org/abs/2310.10046v3
- Date: Wed, 18 Oct 2023 15:42:59 GMT
- Title: TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
- Authors: Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo,
Tieyao Xiang, Yuheng Chen, Shigang Li
- Abstract summary: Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields.
Training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months.
To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system.
- Score: 7.831906758749453
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) with hundreds of billions or trillions of
parameters, represented by chatGPT, have achieved profound impact on various
fields. However, training LLMs with super-large-scale parameters requires large
high-performance GPU clusters and long training periods lasting for months. Due
to the inevitable hardware and software failures in large-scale clusters,
maintaining uninterrupted and long-duration training is extremely challenging.
As a result, A substantial amount of training time is devoted to task
checkpoint saving and loading, task rescheduling and restart, and task manual
anomaly checks, which greatly harms the overall training efficiency. To address
these issues, we propose TRANSOM, a novel fault-tolerant LLM training system.
In this work, we design three key subsystems: the training pipeline automatic
fault tolerance and recovery mechanism named Transom Operator and Launcher
(TOL), the training task multi-dimensional metric automatic anomaly detection
system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous
access automatic fault tolerance and recovery technology named Transom
Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks,
while TEE is responsible for task monitoring and anomaly reporting. TEE detects
training anomalies and reports them to TOL, who automatically enters the fault
tolerance strategy to eliminate abnormal nodes and restart the training task.
And the asynchronous checkpoint saving and loading functionality provided by
TCE greatly shorten the fault tolerance overhead. The experimental results
indicate that TRANSOM significantly enhances the efficiency of large-scale LLM
training on clusters. Specifically, the pre-training time for GPT3-175B has
been reduced by 28%, while checkpoint saving and loading performance have
improved by a factor of 20.
Related papers
- Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks? [74.88417042125985]
We investigate various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity.
We find that even when the outcome error rate for hard task supervision is high, training on such data can outperform perfectly correct supervision on easier subtasks.
Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements.
arXiv Detail & Related papers (2024-10-27T17:55:27Z) - MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning [17.437573206368494]
Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks.
Current algorithms suffer from low sample efficiency, limiting their practical applicability.
We present MENTOR, a method that improves both the architecture and optimization of RL agents.
arXiv Detail & Related papers (2024-10-19T04:31:54Z) - Light-Weight Fault Tolerant Attention for Large Language Model Training [14.178223242134166]
Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks.
LLMs are susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs.
We propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs.
arXiv Detail & Related papers (2024-10-15T15:52:45Z) - Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems [13.880001659156926]
Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities.
Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures.
We introduce a novel reliability metric called emphTraining Overhead Ratio (TOR) to evaluate the reliability of fault-tolerant LLM training systems.
arXiv Detail & Related papers (2024-08-14T11:55:28Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - Boosting Distributed Machine Learning Training Through Loss-tolerant
Transmission Protocol [11.161913989794257]
Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes.
PS communication architecture faces severe long-tail latency caused by many-to-one "incast" traffic patterns, negatively impacting training throughput.
textbfLoss-tolerant textbfTransmission textbfProcol allows partial loss of gradients during synchronization to avoid unneeded retransmission.
textitEarly Close adjusts the loss-tolerant threshold based on network conditions and textit
arXiv Detail & Related papers (2023-05-07T14:01:52Z) - Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse.
SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time.
Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z) - ForkMerge: Mitigating Negative Transfer in Auxiliary-Task Learning [59.08197876733052]
Auxiliary-Task Learning (ATL) aims to improve the performance of the target task by leveraging the knowledge obtained from related tasks.
Sometimes, learning multiple tasks simultaneously results in lower accuracy than learning only the target task, known as negative transfer.
ForkMerge is a novel approach that periodically forks the model into multiple branches, automatically searches the varying task weights.
arXiv Detail & Related papers (2023-01-30T02:27:02Z) - M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task
Learning with Model-Accelerator Co-design [95.41238363769892]
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly.
Current MTL regimes have to activate nearly the entire model even to just execute a single task.
We present a model-accelerator co-design framework to enable efficient on-device MTL.
arXiv Detail & Related papers (2022-10-26T15:40:24Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.