Related papers: LS-CAT: A Large-Scale CUDA AutoTuning Dataset

LS-CAT: A Large-Scale CUDA AutoTuning Dataset

URL: http://arxiv.org/abs/2103.14409v1
Date: Fri, 26 Mar 2021 11:33:48 GMT
Title: LS-CAT: A Large-Scale CUDA AutoTuning Dataset
Authors: Lars Bjertnes, Jacob O. T{\o}rring, Anne C. Elster
Abstract summary: We present how we build the LS-CAT (Large-Scale AutoTuning) dataset from GitHub. Our dataset includes 19 683 kernels focused on linear algebra. The runtime are GPU benchmarks on both Nvidia GTX 980 and Nvidia T4 systems.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The effectiveness of Machine Learning (ML) methods depend on access to large suitable datasets. In this article, we present how we build the LS-CAT (Large-Scale CUDA AutoTuning) dataset sourced from GitHub for the purpose of training NLP-based ML models. Our dataset includes 19 683 CUDA kernels focused on linear algebra. In addition to the CUDA codes, our LS-CAT dataset contains 5 028 536 associated runtimes, with different combinations of kernels, block sizes and matrix sizes. The runtime are GPU benchmarks on both Nvidia GTX 980 and Nvidia T4 systems. This information creates a foundation upon which NLP-based models can find correlations between source-code features and optimal choice of thread block sizes. There are several results that can be drawn out of our LS-CAT database. E.g., our experimental results show that an optimal choice in thread block size can gain an average of 6% for the average case. We thus also analyze how much performance increase can be achieved in general, finding that in 10% of the cases more than 20% performance increase can be achieved by using the optimal block. A description of current and future work is also included.

Related papers

aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis [0.9831489366502302]
We introduce aweSOM, an open-source Python package for machine learning (ML) clustering and classification. We use a Self-organizing Maps (SOM) algorithm to accommodate large ($N > 106$, where $N$ is the number of data points), multidimensional datasets. We find a 10-100x speed up, and significantly improved memory efficiency, due to several built-in optimizations.
arXiv Detail & Related papers (2025-04-13T06:17:35Z)
Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers [65.35142508909892]
We present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
arXiv Detail & Related papers (2025-02-12T06:05:52Z)
Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach [0.8192907805418583]
This study employs two approaches: a custom-implemented tiled matrix multiplication kernel and NVIDIA's CUTLASS library. We developed a Random Forest-based prediction model with multi-output regression capability. Our framework achieved exceptional accuracy with an R2 score of 0.98 for runtime prediction and 0.78 for power prediction.
arXiv Detail & Related papers (2024-11-25T21:47:23Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs [3.7101665559244874]
This paper presents a SYCL implementation of Multi-formedLayer Perceptrons (MLPs) for the Intel Data Center GPU Max 1550. We show with a simple model that this results in a significant increase in arithmetic intensity, leading to improved performance, especially for inference.
arXiv Detail & Related papers (2024-03-26T11:38:39Z)
SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation [0.0]
Large language models (LLMs) have become a significant workload since their appearance. They are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Recent works have developed dedicated kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible.
arXiv Detail & Related papers (2024-03-25T15:26:50Z)
CUDA: Convolution-based Unlearnable Datasets [77.70422525613084]
Large-scale training of modern deep learning models heavily relies on publicly available data on the web. Recent works aim to make data for deep learning models by adding small, specially designed noises. These methods are vulnerable to adversarial training (AT) and/or are computationally heavy.
arXiv Detail & Related papers (2023-03-07T22:57:23Z)
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques [0.6020800302423842]
We propose to use Machine Learning (ML) techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We achieve an accuracy 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets.
arXiv Detail & Related papers (2022-02-16T00:19:15Z)
Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network. We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z)
OSLNet: Deep Small-Sample Classification with an Orthogonal Softmax Layer [77.90012156266324]
This paper aims to find a subspace of neural networks that can facilitate a large decision margin. We propose the Orthogonal Softmax Layer (OSL), which makes the weight vectors in the classification layer remain during both the training and test processes. Experimental results demonstrate that the proposed OSL has better performance than the methods used for comparison on four small-sample benchmark datasets.
arXiv Detail & Related papers (2020-04-20T02:41:01Z)
Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification [47.423272376757204]
The Receptive Field (RF) size has been one of the most important factors for One Dimensional Convolutional Neural Networks (1D-CNNs) on time series classification tasks. We propose an Omni-Scale block (OS-block) for 1D-CNNs, where the kernel sizes are decided by a simple and universal rule. Experiment result shows that models with the OS-block can achieve a similar performance as models with the searched optimal RF size.
arXiv Detail & Related papers (2020-02-24T03:33:58Z)
MOGPTK: The Multi-Output Gaussian Process Toolkit [71.08576457371433]
We present MOGPTK, a Python package for multi-channel data modelling using Gaussian processes (GP) The aim of this toolkit is to make multi-output GP (MOGP) models accessible to researchers, data scientists, and practitioners alike.
arXiv Detail & Related papers (2020-02-09T23:34:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.