Related papers: LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms

LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms

URL: http://arxiv.org/abs/2505.00342v1
Date: Thu, 01 May 2025 06:38:52 GMT
Title: LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms
Authors: Zhihan Jiang, Rui Ren, Guangba Yu, Yulun Wu, Wenwei Gu, Yichen Li, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu,
Abstract summary: Large Language Models (LLMs) have brought about revolutionary changes in diverse fields.<n>This paper proposes the utilization of underlying network flow data to reconstruct the training timelines of jobs.<n>We design LLMPrism, the first black-box performance diagnosis system for LLM training platforms.
Score: 31.576014566773697
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have brought about revolutionary changes in diverse fields, rendering LLM training of utmost importance for modern enterprises. To meet this demand, multi-tenant large-scale LLM training platforms have been built to offer LLM training services. Nevertheless, due to the complexity and synchronous nature of LLM training process, performance issues occur frequently and can result in substantial resource wastage. The limited visibility from the perspective of platform providers impedes existing profiling methods and poses challenges to the monitoring and diagnosis of the performance of LLM training jobs. For the first time, this paper proposes the utilization of underlying network flow data to reconstruct the training timelines of jobs based on the distinct characteristics in the LLM training procedure. We design LLMPrism, the first black-box performance diagnosis system for LLM training platforms. By progressively recognizing LLM training jobs, identifying their parallelism strategies, and reconstructing the training timelines, LLMPrism achieves non-intrusive, lightweight, and continuous monitoring of LLM training systems. Leveraging this monitoring capability, it further effectively diagnoses potential performance issues. Since Oct. 2024, LLMPrism has been deployed on our large-scale production Platform-X, in which the evaluations and deployment experiences demonstrate that LLMPrism can achieve accurate timeline reconstruction with an error within 0.3% and effectively diagnose various performance issues.

Related papers

Teaming LLMs to Detect and Mitigate Hallucinations [0.0]
We show that extending single-model consistency methods can result in substantial further improvements in hallucination detection and mitigation capabilities.<n>We evaluate this "consortium consistency" approach across many model teams from a pool of 15 model teams.
arXiv Detail & Related papers (2025-10-22T12:03:43Z)
Robust LLM Training Infrastructure at ByteDance [21.53715636383753]
ByteRobust is a large-scale GPU infrastructure management system tailored for robust and stable training of large language models (LLMs)<n>It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner.<n>ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPU.
arXiv Detail & Related papers (2025-09-19T15:08:33Z)
Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking [61.61356842567952]
We propose STeP, a novel method for improving LLM-based agent training.<n>We synthesize self-reflected trajectories that include reflections and corrections of error steps.<n>Experiments demonstrate that our method improves agent performance across three representative tasks.
arXiv Detail & Related papers (2025-05-26T14:11:12Z)
L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis [33.245458231704546]
We present the first empirical study on the failure reports of 428 Large Language Models training failures in our production Platform-X between May 2023 and April 2024.<n>Our study reveals that hardware and user faults are the predominant root causes, and current diagnosis processes rely heavily on training logs.<n>We introduce our Log-based Large-scale LLM training failure diagnosis framework, L4, which can automatically extract failure-indicating information from training logs.
arXiv Detail & Related papers (2025-03-26T06:09:55Z)
Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search [2.1637240640145343]
Large language models (LLMs) have demonstrated their remarkable capacity across a variety of tasks.<n>To improve LLMs' reasoning ability, process supervision has proven to be better than outcome supervision.<n>In this work, we study using Monte Carlo Tree Search (MCTS) to generate process supervision data with LLMs themselves for training them.
arXiv Detail & Related papers (2025-01-02T12:09:17Z)
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs [74.35290684163718]
A primary challenge in large language model (LLM) development is their onerous pre-training cost. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM)
arXiv Detail & Related papers (2024-10-24T14:31:52Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
OPDAI at SemEval-2024 Task 6: Small LLMs can Accelerate Hallucination Detection with Weakly Supervised Data [1.3981625092173873]
This paper describes a unified system for hallucination detection of LLMs. It wins the second prize in the model-agnostic track of the SemEval-2024 Task 6.
arXiv Detail & Related papers (2024-02-20T11:01:39Z)
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning. We propose a novel method, termed "reflection-tuning" This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.