Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems
- URL: http://arxiv.org/abs/2406.19621v1
- Date: Fri, 28 Jun 2024 03:07:53 GMT
- Title: Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems
- Authors: Yufan Xia, Giuseppe Maria Junior Barca,
- Abstract summary: We present an extension to the Architecture and Data-Structure Aware Linear Algebra library that uses machine learning to optimize the runtime of all BLAS Level 3 operations.
We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations.
We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.
Related papers
- A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication [1.5223740593989443]
We present a proof-of-concept approach to building an Architecture and Data- AwareStructure Linear Algebra (ADSALA) software library.<n>Our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task.<n>Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup.
arXiv Detail & Related papers (2026-01-14T03:28:54Z) - HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models [50.31704374968706]
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding.<n>They typically require extremely high computational resources for training to achieve cross-modal alignment at multi-granularity levels.<n>We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels.
arXiv Detail & Related papers (2025-10-23T08:16:44Z) - Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation [0.0]
We evaluate the capability of existing general LLMs for BLAS code generation for CPUs.<n>We found that correct code can be generated in many cases even when only routine name are given.<n>We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent.
arXiv Detail & Related papers (2025-07-07T06:33:59Z) - BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent [32.81416809245337]
kernel tuning involves systematically adjusting kernel configurations to optimize system performance.<n>Despite recent advancements in large language models (LLMs), kernel tuning remains a critical challenge.<n>We propose BYOS, a framework that automates a LLM-powered framework for kernel tuning.
arXiv Detail & Related papers (2025-03-12T15:50:16Z) - Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement [16.464496913614315]
This paper introduces SCILIB-Accel, a novel tool for automatic BLAS offload.
The tool intercepts BLAS symbols directly from a CPU binary, requiring no code modifications or recompilation.
SCILIB-Accel has been evaluated using multiple quantum physics codes on up to a few hundred GPU nodes, yielding promising speedups.
arXiv Detail & Related papers (2024-12-31T05:24:30Z) - Should AI Optimize Your Code? A Comparative Study of Current Large Language Models Versus Classical Optimizing Compilers [0.0]
Large Language Models (LLMs) raise intriguing questions about the potential for AI-driven approaches to revolutionize code optimization methodologies.
This paper presents a comparative analysis between two state-of-the-art Large Language Models, GPT-4.0 and CodeLlama-70B, and traditional optimizing compilers.
arXiv Detail & Related papers (2024-06-17T23:26:41Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload.
It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance.
processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads.
Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z) - Dissecting the Runtime Performance of the Training, Fine-tuning, and
Inference of Large Language Models [26.2566707495948]
Large Language Models (LLMs) have seen great advance in both academia and industry.
We benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes.
Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs.
arXiv Detail & Related papers (2023-11-07T03:25:56Z) - Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain.
We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement.
We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - DnS: Distill-and-Select for Efficient and Accurate Video Indexing and
Retrieval [23.42790810694723]
We propose a Knowledge Distillation framework, which we call Distill-and-Select (DnS)
We train several students with different architectures and arrive at different trade-offs of performance and efficiency.
Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets -- this leads to good students.
arXiv Detail & Related papers (2021-06-24T18:34:24Z) - Learning to Optimize: A Primer and A Benchmark [94.29436694770953]
Learning to optimize (L2O) is an emerging approach that leverages machine learning to develop optimization methods.
This article is poised to be the first comprehensive survey and benchmark of L2O for continuous optimization.
arXiv Detail & Related papers (2021-03-23T20:46:20Z) - A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions.
Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data.
Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.