Related papers: Towards Efficient Multi-Scale Deformable Attention on NPU

Towards Efficient Multi-Scale Deformable Attention on NPU

URL: http://arxiv.org/abs/2505.14022v1
Date: Tue, 20 May 2025 07:25:23 GMT
Title: Towards Efficient Multi-Scale Deformable Attention on NPU
Authors: Chenghuan Huang, Zhigeng Xu, Chong Sun, Chen Li, Ziyang Ma,
Abstract summary: We present a co-design approach that rethinks memory access and computation strategies for MSDA on the Ascend NPU architecture.<n>Our implementation supports both efficient forward and backward computation, is fully adapted for training workloads, and incorporates a suite of hardware-aware optimizations.
Score: 4.661513231235483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-scale deformable attention (MSDA) is a flexible and powerful feature extraction mechanism for visual tasks, but its random-access grid sampling strategy poses significant optimization challenges, especially on domain-specific accelerators such as NPUs. In this work, we present a co-design approach that systematically rethinks memory access and computation strategies for MSDA on the Ascend NPU architecture. With this co-design approach, our implementation supports both efficient forward and backward computation, is fully adapted for training workloads, and incorporates a suite of hardware-aware optimizations. Extensive experiments show that our solution achieves up to $5.9\times$ (forward), $8.9\times$ (backward), and $7.3\times$ (end-to-end training) speedup over the grid sample-based baseline, and $1.9\times$, $2.4\times$, and $2.0\times$ acceleration over the latest vendor library, respectively.

Related papers

Coflex: Enhancing HW-NAS with Sparse Gaussian Processes for Efficient and Scalable DNN Accelerator Design [4.489116569191255]
Hardware-Aware Neural Search (HW-NAS) is an efficient approach to automatically co-optimizing neural network performance and hardware energy efficiency.<n>We propose Coflex, a novel HW-NAS framework that integrates the Sparse Gaussian Process (SGP) with multi-objective Bayesian optimization.
arXiv Detail & Related papers (2025-07-31T11:16:46Z)
MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration [8.43012094714496]
This paper presents a unified framework for codifying and automating optimization strategies to deploy deep neural networks (DNNs) on resource-constrained hardware.<n>Our novel approach addresses two key issues: (i)encoding custom optimization strategies and (ii)enabling cross-stage optimization search.
arXiv Detail & Related papers (2025-02-09T11:02:06Z)
HEPPO-GAE: Hardware-Efficient Proximal Policy Optimization with Generalized Advantage Estimation [0.0]
HEPPO-GAE is an FPGA-based accelerator designed to optimize the Generalized Advantage Estimation stage in Proximal Policy Optimization.<n>Key innovation is our strategic standardization technique, which combines dynamic reward standardization and block standardization for values, followed by 8-bit uniform quantization.<n>Our single-chip solution minimizes communication latency and throughput bottlenecks, significantly boosting PPO training efficiency.
arXiv Detail & Related papers (2025-01-22T08:18:56Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models. We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z)
Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning [11.139231636612479]
We propose Pruner and MoA-Pruner to speed up program tuning.<n>Pruner is a "Draft-then-Verify" exploration mechanism that accelerates the schedule search process.<n>MoA-Pruner introduces a Momentum online Adaptation strategy to address the cross-platform online unawareness.
arXiv Detail & Related papers (2024-02-04T06:11:12Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework. We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone. This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge. We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
A Full-stack Accelerator Search Technique for Vision Applications [11.932331630567512]
We propose a hardware accelerator search framework that defines a broad optimization environment. FAST can be used on any number and type of deep learning workload. Designs generated by FAST optimized for single workloads can improve Perf/TDP by over 6x in the best case. On a limited workload subset, FAST improves Perf/TDP 2.85x on average, with a reduction to 2.35x for a single design optimized over the set of workloads.
arXiv Detail & Related papers (2021-05-26T21:10:20Z)
SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost Computation [97.78417228445883]
We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs) We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2. We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
arXiv Detail & Related papers (2020-05-07T12:12:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.