Related papers: Robust LLM Training Infrastructure at ByteDance

Robust LLM Training Infrastructure at ByteDance

URL: http://arxiv.org/abs/2509.16293v4
Date: Mon, 20 Oct 2025 09:35:27 GMT
Title: Robust LLM Training Infrastructure at ByteDance
Authors: Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang,
Abstract summary: ByteRobust is a large-scale GPU infrastructure management system tailored for robust and stable training of large language models (LLMs)<n>It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner.<n>ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPU.
Score: 21.53715636383753
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPUs.

Related papers

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better [24.03797089794804]
We propose a Late-to-Early Training (LET) paradigm that enables Large Language Models to learn later knowledge in earlier steps and earlier layers.<n>We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning.<n>Our method achieves up to 1.6$times$ speedup with nearly 5% improvement in downstream task accuracy compared to standard training.
arXiv Detail & Related papers (2026-02-05T07:19:34Z)
AmorLIP: Efficient Language-Image Pretraining via Amortization [52.533088120633785]
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks.<n>We propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks.
arXiv Detail & Related papers (2025-05-25T05:30:37Z)
LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms [31.576014566773697]
Large Language Models (LLMs) have brought about revolutionary changes in diverse fields.<n>This paper proposes the utilization of underlying network flow data to reconstruct the training timelines of jobs.<n>We design LLMPrism, the first black-box performance diagnosis system for LLM training platforms.
arXiv Detail & Related papers (2025-05-01T06:38:52Z)
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training [60.9776082805359]
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to training instability.<n>This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets.<n>We propose Spike-Aware Adam with Momentum Reset, a novel designed to counteract gradient spikes through momentum reset and spike-aware clipping.
arXiv Detail & Related papers (2025-01-12T15:21:22Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs [30.034205048718885]
Training large language models (LLMs) at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers.
arXiv Detail & Related papers (2024-02-23T22:10:59Z)
vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training [3.0051215935332505]
This paper presents our profiling-driven simulator called vTrain to determine an efficient and cost-effective training system configuration. We demonstrate vTrain's practicality through several case studies, e.g., effectively evaluating optimal training parallelization strategies.
arXiv Detail & Related papers (2023-11-27T13:35:15Z)
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs [7.831906758749453]
Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields. Training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months. To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system.
arXiv Detail & Related papers (2023-10-16T04:06:52Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length [65.24730341801468]
This paper introduces a novel, simple, and effective method named growlength'' to accelerate the pretraining process of Large Language Models. Our method progressively increases the training length throughout the pretraining phase, thereby mitigating computational costs and enhancing efficiency.
arXiv Detail & Related papers (2023-10-01T05:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.