Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)
- URL: http://arxiv.org/abs/2602.19762v1
- Date: Mon, 23 Feb 2026 12:12:39 GMT
- Title: Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)
- Authors: Mohammed Javed Absar, Muthu Baskaran, Abhikrant Sharma, Abhilash Bhandari, Ankit Aggarwal, Arun Rangasamy, Dibyendu Das, Fateme Hosseini, Franck Slama, Iulian Brumar, Jyotsna Verma, Krishnaprasad Bindumadhavan, Mitesh Kothari, Mohit Gupta, Ravishankar Kolachana, Richard Lethin, Samarth Narang, Sanjay Motilal Ladwa, Shalini Jain, Snigdha Suresh Dalvi, Tasmia Rahman, Venkat Rasagna Reddy Komatireddy, Vivek Vasudevbhai Pandya, Xiyue Shi, Zachary Zipper,
- Abstract summary: Hexagon-MLIR is an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU)<n>It provides unified support for lowering Triton kernels and PyTorch models.
- Score: 3.8043062351078585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI compilation capabilities through a more flexible approach. Hexagon-MLIR is a work-in-progress, and we are continuing to add many more optimizations and capabilities in this effort.
Related papers
- CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis [13.239454996851771]
Modern AI models demand high-performance computation kernels.<n>Akg kernel agent (AI-driven Kernel Generator) is designed to support multiple domain-specific languages.<n>System's modular design allows rapid integration of backend DSLs and hardware targets.<n>System achieves an average speedup of 1.46$times over PyTorch Eager baselines.
arXiv Detail & Related papers (2025-12-29T12:42:05Z) - Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels [37.00431889602245]
This paper introduces a compilation scheme that automatically generates scalable, high-performance micro Kernels.<n>We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions.<n>Experiments show that the generated nano Kernels are of production-quality, and competitive with state-of-the-art micro Kernel libraries.
arXiv Detail & Related papers (2025-11-14T14:32:28Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations [4.776283807742058]
eIQ Neutron efficient-NPU is integrated into a commercial flagship MPU.<n>Our solution achieves an average speedup of 1.8x (4x peak) at equal TOPS and memory resources across standard AI-benchmarks.
arXiv Detail & Related papers (2025-09-17T19:45:51Z) - Towards a high-performance AI compiler with upstream MLIR [34.89141656581549]
This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance.
We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from Packing and PyTorch.
arXiv Detail & Related papers (2024-04-15T10:35:50Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Enabling Retargetable Optimizing Compilers for Quantum Accelerators via
a Multi-Level Intermediate Representation [78.8942067357231]
We present a multi-level quantum-classical intermediate representation (IR) that enables an optimizing, retargetable, ahead-of-time compiler.
We support the entire gate-based OpenQASM 3 language and provide custom extensions for common quantum programming patterns and improved syntax.
Our work results in compile times that are 1000x faster than standard Pythonic approaches, and 5-10x faster than comparative standalone quantum language compilers.
arXiv Detail & Related papers (2021-09-01T17:29:47Z) - Bring Your Own Codegen to Deep Learning Compiler [8.87545486816377]
This paper proposes an open source framework that enables users to only concentrate on the development of their proprietary code generation tools.
Our framework provides users flexible and easy-to-use interfaces to partition their models into segments that can be executed on "the best" processors.
arXiv Detail & Related papers (2021-05-03T17:22:25Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.