Optimizing Sparse Linear Algebra Through Automatic Format Selection and
Machine Learning
- URL: http://arxiv.org/abs/2303.05098v1
- Date: Thu, 9 Mar 2023 08:17:26 GMT
- Title: Optimizing Sparse Linear Algebra Through Automatic Format Selection and
Machine Learning
- Authors: Christodoulos Stylianou, Michele Weiland
- Abstract summary: We introduce Morpheus-Oracle, a library that provides a lightweight ML auto-tuner capable of accurately predicting the optimal format across multiple backends.
We achieve an average classification accuracy and balanced accuracy of 92.63% and 80.22% respectively across the available systems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse matrices are an integral part of scientific simulations. As hardware
evolves new sparse matrix storage formats are proposed aiming to exploit
optimizations specific to the new hardware. In the era of heterogeneous
computing, users often are required to use multiple formats for their
applications to remain optimal across the different available hardware,
resulting in larger development times and maintenance overhead. A potential
solution to this problem is the use of a lightweight auto-tuner driven by
Machine Learning (ML) that would select for the user an optimal format from a
pool of available formats that will match the characteristics of the sparsity
pattern, target hardware and operation to execute.
In this paper, we introduce Morpheus-Oracle, a library that provides a
lightweight ML auto-tuner capable of accurately predicting the optimal format
across multiple backends, targeting the major HPC architectures aiming to
eliminate any format selection input by the end-user. From more than 2000
real-life matrices, we achieve an average classification accuracy and balanced
accuracy of 92.63% and 80.22% respectively across the available systems. The
adoption of the auto-tuner results in average speedup of 1.1x on CPUs and 1.5x
to 8x on NVIDIA and AMD GPUs, with maximum speedups reaching up to 7x and 1000x
respectively.
Related papers
- Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration [0.0]
Low-Rank GEMM is a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity.<n>System automatically adapts to hardware capabilities, selecting optimal decomposition methods.
arXiv Detail & Related papers (2025-11-24T01:13:52Z) - APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration [5.075697428779204]
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance.<n>This is primarily due to the limited support for the GPU Cores, inefficient memory management, and inflexible kernel optimizations.<n>We propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM.
arXiv Detail & Related papers (2025-08-26T14:48:29Z) - FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models [49.397861654088636]
We propose a two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces.<n>We show that our strategy achieves faster runtime and reduced memory usage by up to $25%$ across different model sizes.
arXiv Detail & Related papers (2025-05-23T14:37:00Z) - Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression [5.995843028932167]
We introduce a dynamic-programming-based approach to explore more of the search space by decomposing large program specifications into smaller specifications.<n>To reduce memory requirements, we employ a novel memoization table representation, which indexes specifications by coordinates in $Z_geq 0$ and compresses identical, adjacent solutions.
arXiv Detail & Related papers (2025-05-03T00:14:31Z) - Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving [12.068287973463786]
Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources.
Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption.
Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two.
arXiv Detail & Related papers (2025-04-17T14:45:03Z) - Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [56.58170370127227]
We show that optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes.
This work is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers.
arXiv Detail & Related papers (2025-03-06T18:58:29Z) - AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.
Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.
We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z) - Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Misam: Using ML in Dataflow Selection of Sparse-Sparse Matrix Multiplication [0.8363939984237685]
Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in scientific computing, graph analytics, and deep learning.
Traditional hardware accelerators are tailored for specific sparsity patterns with fixed dataflow schemes.
This paper presents a machine learning based approach for adaptively selecting the most appropriate dataflow scheme for SpGEMM tasks.
arXiv Detail & Related papers (2024-06-14T16:36:35Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - Machine Learning-Driven Adaptive OpenMP For Portable Performance on
Heterogeneous Systems [1.885335997132172]
Adapting a program to a new heterogeneous platform is laborious and requires developers to manually explore a vast space of execution parameters.
This paper proposes extensions to OpenMP for autonomous, machine learning-driven adaptation.
Our solution includes a set of novel language constructs, compiler transformations, and runtime support.
arXiv Detail & Related papers (2023-03-15T18:37:18Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - On Extending Amdahl's law to Learn Computer Performance [0.0]
The problem of learning parallel computer performance is investigated in the context of multicore processors.
We propose to extend Amdahl's law to accommodate multiple resources into the overall speedup equation.
We transform the speedup equation into a multivariable regression problem suitable for machine learning.
arXiv Detail & Related papers (2021-10-15T02:37:07Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.