Related papers: Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

URL: http://arxiv.org/abs/2509.22832v1
Date: Fri, 26 Sep 2025 18:38:25 GMT
Title: Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM
Authors: Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang,
Abstract summary: Training Large Language Models (LLMs) is one of the most compute-intensive tasks in high-performance computing.<n>We present a framework to predict end-to-end training time for multi-billion parameter models distributed across hundreds of GPU.<n>Our framework achieves low average prediction errors-4.98% on Perlmutter(A100) and 9.38% on Vista(GH200)-for models up to 20B parameters across 128 GPU.
Score: 11.87842612818933
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98\% on Perlmutter(A100) and 9.38\% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.

Related papers

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective [6.51239603014107]
Large Language Models (LLMs) have pushed training workloads beyond the limits of single-node analysis.<n>We present a comprehensive characterization of LLM training across diverse real-world workloads and hardware platforms.
arXiv Detail & Related papers (2025-09-12T16:05:07Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training [29.44470664154098]
We show that careful consideration of hardware configuration and parallelization strategy is critical for effective scaling of model size, training data, and total computation.<n>We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies.
arXiv Detail & Related papers (2024-11-20T06:05:11Z)
Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost. We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion. By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z)
Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms [4.959530958049395]
We develop a pipeline to Characterize and predict the training performance of modern machine learning (ML) workloads on compute systems. Our pipeline generalizes to other types of ML workloads, such as Transformer-based NLP models. It is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration.
arXiv Detail & Related papers (2024-04-19T07:20:33Z)
Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.<n>This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.<n>This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z)
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting. atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.