Related papers: Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface

Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface

URL: http://arxiv.org/abs/2504.18430v1
Date: Fri, 25 Apr 2025 15:43:50 GMT
Title: Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface
Authors: Erika Hunhoff, Joseph Melber, Kristof Denolf, Andra Bisca, Samuel Bayliss, Stephen Neuendorffer, Jeff Fifield, Jack Lo, Pranathi Vasireddy, Phil James-Roxby, Eric Keller,
Abstract summary: This work aims to increase efficiency of designers using IRON, a toolkit for close-to-metal NPU performance engineers.<n>We provide an updated programmer interface to IRON containing new and refined programming constructs.<n>Analysis shows 26% average reduction in lines of code and decreases in Halstead metrics for a variety of designs.
Score: 0.9199464917832796
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Accelerators such as neural processing units (NPUs) deliver an enticing balance of performance and efficiency compared to general purpose compute architectures. However, effectively leveraging accelerator capabilities is not always simple: low-level programming toolkits may require substantial developer effort while high-level programming toolkits may abstract critical optimization features. This work aims to increase efficiency of designers using IRON, a toolkit for close-to-metal NPU performance engineers. We provide an updated programmer interface to IRON containing new and refined programming constructs. The new interface includes extensible features for placement and data transformation. These contributions are evaluated in terms of 1) efficiency, with analysis showing ~26% average reduction in lines of code and decreases in Halstead metrics for a variety of designs; 2) expressivity, demonstrating the new interface supports the wide range of features and patterns already supported by IRON; and 3) extensibility, illustrating the new tooling for placement and tiling can be extended to accommodate common use-cases.

Related papers

Hardware/Software Co-Design of RISC-V Extensions for Accelerating Sparse DNNs on FPGAs [1.4225653519332482]
We propose novel RISC-V extensions for accelerating DNN models containing semi-structured and unstructured sparsity. Our designs consume a small amount of additional FPGA resources such that the resulting co-designs enable the acceleration of DNNs even on small FPGAs. We benchmark our designs on standard TinyML applications such as keyword spotting, image classification, and person detection.
arXiv Detail & Related papers (2025-04-28T10:19:39Z)
OTC: Optimal Tool Calls via Reinforcement Learning [87.28134636548705]
We propose a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity.<n>Our approach reduces tool calls by up to 73.1% and improves tool productivity by up to 229.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z)
Allo: A Programming Model for Composable Accelerator Design [7.884541004161727]
We introduce Allo, a composable programming model for efficient spatial accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data type from algorithm specification. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench.
arXiv Detail & Related papers (2024-04-07T05:47:54Z)
Mechanistic Design and Scaling of Hybrid Architectures [114.3129802943915]
We identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis. We find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures.
arXiv Detail & Related papers (2024-03-26T16:33:12Z)
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs) It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Chimera: A Hybrid Machine Learning Driven Multi-Objective Design Space Exploration Tool for FPGA High-Level Synthesis [11.128278223431805]
High-level synthesis (HLS) tools were created to simplify hardware designs for FPGAs. Applying these optimizations to achieve high performance is time-consuming and usually requires expert knowledge. We present an automated design space exploration tool for applying HLS optimization directives, called Chimera.
arXiv Detail & Related papers (2022-07-03T21:13:55Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs [11.807678100385164]
This work provides a survey of neural network accelerator optimization approaches that have been used in recent works. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately.
arXiv Detail & Related papers (2021-06-24T07:53:56Z)
A Learned Performance Model for Tensor Processing Units [5.733911161090224]
We demonstrate a method of learning performance models from a corpus of graph programs for Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks. It helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
arXiv Detail & Related papers (2020-08-03T17:24:52Z)
Towards High Performance Relativistic Electronic Structure Modelling: The EXP-T Program Package [68.8204255655161]
We present a new implementation of the FS-RCC method designed for modern parallel computers. The performance and scaling features of the implementation are analyzed. The software developed allows to achieve a completely new level of accuracy for prediction of properties of atoms and molecules containing heavy and superheavy nuclei.
arXiv Detail & Related papers (2020-04-07T20:08:30Z)
Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction [0.0]
We propose lightweight augmented neural networks for arbitrary combinations of kernel-variant- hardware. We are able to obtain a low MAPE of 3%, significantly outperforming traditional feed-forward neural networks. Our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide's auto-scheduler.
arXiv Detail & Related papers (2020-03-17T02:19:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.