Related papers: QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

URL: http://arxiv.org/abs/2505.06302v1
Date: Thu, 08 May 2025 02:36:21 GMT
Title: QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives
Authors: Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li,
Abstract summary: We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp)<n>We show that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance.
Score: 21.529815293977833
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 \times$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 \%$ of OpenBLAS on RISC-V CPUs, and $124 \%$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 \times$ compared with human experts.

Related papers

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation [8.64220475114214]
oneDNN Graph Compiler employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high performance code generation. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical computation graphs.
arXiv Detail & Related papers (2023-01-03T19:52:17Z)
TensorIR: An Abstraction for Automatic Tensorized Program Optimization [22.812702519665617]
We presentIR, a compiler for optimizing programs with tensor computation primitives. We build an end-to-end framework on top of our compilation to automatically optimize deep learning models for given tensor computation primitives.
arXiv Detail & Related papers (2022-07-09T16:28:57Z)
SOL: Reducing the Maintenance Overhead for Integrating Hardware Support into AI Frameworks [0.7614628596146599]
AI frameworks such as Theano, Caffe, Chainer, CNTK, MxNet, PyTorch, DL4J provide a high level scripting API. Less mainstream CPU, GPU or accelerator vendors need to put in a high effort to get their hardware supported by these frameworks. NEC Laboratories Europe started developing the SOL AI Optimization project already years ago.
arXiv Detail & Related papers (2022-05-19T08:40:46Z)
The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding [14.635810503599759]
CoRa is a tensor compiler that allows users to easily generate efficient code for ragged tensor operators. We evaluate CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model.
arXiv Detail & Related papers (2021-10-19T19:39:04Z)
Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads [86.62083829086393]
This work introduces the Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of Deep Learning-workloads with high-productivity. TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual ISA), which can be utilized as building-blocks to construct complex operators on high-dimensional tensors. We demonstrate the efficacy of our approach using standalone kernels and end-to-end DL-workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms.
arXiv Detail & Related papers (2021-04-12T18:35:49Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures [56.69373580921888]
We focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance.
arXiv Detail & Related papers (2020-05-10T14:40:16Z)
Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction [0.0]
We propose lightweight augmented neural networks for arbitrary combinations of kernel-variant- hardware. We are able to obtain a low MAPE of 3%, significantly outperforming traditional feed-forward neural networks. Our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide's auto-scheduler.
arXiv Detail & Related papers (2020-03-17T02:19:54Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.