Related papers: eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations

eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations

URL: http://arxiv.org/abs/2509.14388v1
Date: Wed, 17 Sep 2025 19:45:51 GMT
Title: eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations
Authors: Lennart Bamberg, Filippo Minnella, Roberto Bosio, Fabrizio Ottati, Yuebin Wang, Jongmin Lee, Luciano Lavagno, Adam Fuks,
Abstract summary: eIQ Neutron efficient-NPU is integrated into a commercial flagship MPU.<n>Our solution achieves an average speedup of 1.8x (4x peak) at equal TOPS and memory resources across standard AI-benchmarks.
Score: 4.776283807742058
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural Processing Units (NPUs) are key to enabling efficient AI inference in resource-constrained edge environments. While peak tera operations per second (TOPS) is often used to gauge performance, it poorly reflects real-world performance and typically rather correlates with higher silicon cost. To address this, architects must focus on maximizing compute utilization, without sacrificing flexibility. This paper presents the eIQ Neutron efficient-NPU, integrated into a commercial flagship MPU, alongside co-designed compiler algorithms. The architecture employs a flexible, data-driven design, while the compiler uses a constrained programming approach to optimize compute and data movement based on workload characteristics. Compared to the leading embedded NPU and compiler stack, our solution achieves an average speedup of 1.8x (4x peak) at equal TOPS and memory resources across standard AI-benchmarks. Even against NPUs with double the compute and memory resources, Neutron delivers up to 3.3x higher performance.

Related papers

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design [4.251127523213289]
TriGen is a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design.<n>We show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.
arXiv Detail & Related papers (2026-02-13T14:28:31Z)
Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends [0.2578242050187029]
Edge devices typically operate under strict constraints on processing power, memory, and energy consumption.<n>Small Language Models (SLMs) offer lightweight alternatives that bring AI inference to resource-constrained environments.<n>We analyze both maximum achievable performance and processing and energy efficiency across commercial solutions available for each platform.
arXiv Detail & Related papers (2025-11-27T11:11:01Z)
xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z)
Serving Large Language Models on Huawei CloudMatrix384 [28.88558053380112]
Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency.<n>This paper introduces Huawei CloudMatrix, a next-generation AI architecture realized in the production-grade CloudMatrix384 supernode.<n>It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources.
arXiv Detail & Related papers (2025-06-15T03:41:34Z)
XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units [0.6063137165121326]
State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks.<n>XAMBA is the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs.<n>XAMBA mitigates key bottlenecks using CumBA and ReduBA, replacing sequential CumSum and ReduceSum operations with matrix-based computations.
arXiv Detail & Related papers (2025-02-10T17:33:30Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to Infer Hardware Performances [58.720142291102135]
'VPUNN' is a neural network-based cost model trained on low-level task profiling. It consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.
arXiv Detail & Related papers (2022-05-09T22:48:39Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices [35.90103072918056]
Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique. The growth of model size poses a key energy efficiency challenge for the underlying computing platform. This paper proposes PermDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models.
arXiv Detail & Related papers (2020-04-23T02:26:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.