CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2501.08071v1
- Date: Tue, 14 Jan 2025 12:36:18 GMT
- Title: CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning
- Authors: Guoliang He, Eiko Yoneki,
- Abstract summary: In this work, we employ an automatic approach to optimize GPU SASS schedules.
The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling.
Experiments show that CuAsmRL can further improve the performance of existing kernels by up $26%$, and on average $9%$.
- Score: 0.0
- License:
- Abstract: Large language models (LLMs) are remarked by their substantial computational requirements. To mitigate the cost, researchers develop specialized CUDA kernels, which often fuse several tensor operations to maximize the utilization of GPUs as much as possible. However, those specialized kernels may still leave performance on the table as CUDA assembly experts show that manual optimization of GPU SASS schedules can lead to better performance, and trial-and-error is largely employed to manually find the best GPU SASS schedules. In this work, we employ an automatic approach to optimize GPU SASS schedules, which thus can be integrated into existing compiler frameworks. The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling. To this end, we formulate an assembly game, where RL agents can play to find the best GPU SASS schedules. The assembly game starts from a \textit{-O3} optimized SASS schedule, and the RL agents can iteratively apply actions to mutate the current schedules. Positive rewards are generated if the mutated schedules get higher throughput by executing on GPUs. Experiments show that CuAsmRL can further improve the performance of existing specialized CUDA kernels transparently by up to $26\%$, and on average $9\%$. Moreover, it is used as a tool to reveal potential optimization moves learned automatically.
Related papers
- ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning [54.08906841213777]
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning.
We propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of computation times.
We show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times.
arXiv Detail & Related papers (2025-02-02T12:22:26Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.
Various memory-efficient Scals have been proposed to reduce memory usage.
They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - 3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt [65.25603275491544]
We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS)
Our method is 30% faster than the original 3DGS while obtaining the same reconstruction quality optimization.
arXiv Detail & Related papers (2024-09-19T16:31:44Z) - SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation [0.0]
Large language models (LLMs) have become a significant workload since their appearance.
They are also computationally expensive as they have billions of parameters and are trained with massive amounts of data.
Recent works have developed dedicated kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible.
arXiv Detail & Related papers (2024-03-25T15:26:50Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor
Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering.
With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z) - Deep Learning Models on CPUs: A Methodology for Efficient Training [1.7150798380270715]
This paper makes several contributions to research on training deep learning models using CPUs.
It presents a method for optimizing the training of deep learning models on Intel CPUs and a toolkit called ProfileDNN.
arXiv Detail & Related papers (2022-06-20T22:42:14Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Online Evolutionary Batch Size Orchestration for Scheduling Deep
Learning Workloads in GPU Clusters [10.395955671683245]
We propose ONES, an ONline Scheduler for elastic batch size orchestration.
ONES automatically manages the elasticity of each job based on the training batch size.
We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
arXiv Detail & Related papers (2021-08-08T14:20:05Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.