Related papers: Forecasting GPU Performance for Deep Learning Training and Inference

Forecasting GPU Performance for Deep Learning Training and Inference

URL: http://arxiv.org/abs/2407.13853v2
Date: Fri, 15 Nov 2024 22:30:38 GMT
Title: Forecasting GPU Performance for Deep Learning Training and Inference
Authors: Seonho Lee, Amar Phanishayee, Divya Mahajan,
Abstract summary: NeuSight is a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works.
Score: 10.741682409837612
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works, where both GPT3 and H100 were not used to train the framework.

Related papers

Omniwise: Predicting GPU Kernels Performance with LLMs [0.06666419797034795]
We introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline that applies large language models (LLMs) to GPU kernel performance prediction.<n>It can predict key performance metrics, including memory bandwidth, cache hit rates, GFLOPs, and arithmetic intensity, directly from kernel code without the need for code execution or profiling tools.<n>Our approach achieves over 90% of predictions within 10% relative error on GPU kernels executed on AMD MI250 and MI300X architectures.
arXiv Detail & Related papers (2025-06-25T23:36:44Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
Accurate GPU Memory Prediction for Deep Learning Jobs through Dynamic Analysis [0.3867363075280544]
Out-of-Memory errors present a primary impediment to model training and efficient resource utilization. VeritasEst is an entirely CPU-based analysis tool capable of accurately predicting the peak GPU memory required for Deep Learning training tasks. Its performance was validated through thousands of experimental runs across convolutional neural network (CNN) models.
arXiv Detail & Related papers (2025-04-04T19:20:03Z)
HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs [26.00138507762884]
HeterMoE is a system to efficiently train MoE models on heterogeneous GPUs. We show that HeterMoE achieves up to 2.3x speed-up compared to existing MoE training systems.
arXiv Detail & Related papers (2025-04-04T18:55:52Z)
SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation [0.0]
Large language models (LLMs) have become a significant workload since their appearance. They are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Recent works have developed dedicated kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible.
arXiv Detail & Related papers (2024-03-25T15:26:50Z)
PockEngine: Sparse and Efficient Fine-tuning in a Pocket [62.955793932377524]
We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation and sparsely updates the model with measured memory saving and latency reduction. Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9$times$ faster than the PyTorch.
arXiv Detail & Related papers (2023-10-26T19:46:11Z)
Benchmarking GPUs on SVBRDF Extractor Model [0.0]
In this work, we try to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256) In this work, we tried to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256)
arXiv Detail & Related papers (2023-10-19T17:09:06Z)
DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training [18.52206409432894]
DistTGL is an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters. In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
arXiv Detail & Related papers (2023-07-14T22:52:27Z)
Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling. We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach [1.5857983167543392]
We present a new practical technique to help users make informed and cost-efficient GPU selections. We make predictions by scaling the execution time of each operation in a training iteration from one GPU to another using either (i) wave scaling, a technique based on a GPU's execution model, or (ii) pre-trained multilayer perceptrons. We implement our technique into a Python library called Surfer and find that it makes accurate iteration execution time predictions on ResNet-50, Inception v3, the Transformer, GNMT, and DCGAN.
arXiv Detail & Related papers (2021-01-31T20:17:46Z)
Understanding Training Efficiency of Deep Learning Recommendation Models at Scale [8.731263641794897]
This paper explains the intricacies of using GPUs for training recommendation models. factors affecting hardware efficiency at scale, and learnings from a new scale-up GPU server design, Zion.
arXiv Detail & Related papers (2020-11-11T01:21:43Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle. Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.