Towards Efficient Multi-Scale Deformable Attention on NPU
- URL: http://arxiv.org/abs/2505.14022v1
- Date: Tue, 20 May 2025 07:25:23 GMT
- Title: Towards Efficient Multi-Scale Deformable Attention on NPU
- Authors: Chenghuan Huang, Zhigeng Xu, Chong Sun, Chen Li, Ziyang Ma,
- Abstract summary: We present a co-design approach that rethinks memory access and computation strategies for MSDA on the Ascend NPU architecture.<n>Our implementation supports both efficient forward and backward computation, is fully adapted for training workloads, and incorporates a suite of hardware-aware optimizations.
- Score: 4.661513231235483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-scale deformable attention (MSDA) is a flexible and powerful feature extraction mechanism for visual tasks, but its random-access grid sampling strategy poses significant optimization challenges, especially on domain-specific accelerators such as NPUs. In this work, we present a co-design approach that systematically rethinks memory access and computation strategies for MSDA on the Ascend NPU architecture. With this co-design approach, our implementation supports both efficient forward and backward computation, is fully adapted for training workloads, and incorporates a suite of hardware-aware optimizations. Extensive experiments show that our solution achieves up to $5.9\times$ (forward), $8.9\times$ (backward), and $7.3\times$ (end-to-end training) speedup over the grid sample-based baseline, and $1.9\times$, $2.4\times$, and $2.0\times$ acceleration over the latest vendor library, respectively.
Related papers
- Coflex: Enhancing HW-NAS with Sparse Gaussian Processes for Efficient and Scalable DNN Accelerator Design [4.489116569191255]
Hardware-Aware Neural Search (HW-NAS) is an efficient approach to automatically co-optimizing neural network performance and hardware energy efficiency.<n>We propose Coflex, a novel HW-NAS framework that integrates the Sparse Gaussian Process (SGP) with multi-objective Bayesian optimization.
arXiv Detail & Related papers (2025-07-31T11:16:46Z) - MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration [8.43012094714496]
This paper presents a unified framework for codifying and automating optimization strategies to deploy deep neural networks (DNNs) on resource-constrained hardware.<n>Our novel approach addresses two key issues: (i)encoding custom optimization strategies and (ii)enabling cross-stage optimization search.
arXiv Detail & Related papers (2025-02-09T11:02:06Z) - HEPPO-GAE: Hardware-Efficient Proximal Policy Optimization with Generalized Advantage Estimation [0.0]
HEPPO-GAE is an FPGA-based accelerator designed to optimize the Generalized Advantage Estimation stage in Proximal Policy Optimization.<n>Key innovation is our strategic standardization technique, which combines dynamic reward standardization and block standardization for values, followed by 8-bit uniform quantization.<n>Our single-chip solution minimizes communication latency and throughput bottlenecks, significantly boosting PPO training efficiency.
arXiv Detail & Related papers (2025-01-22T08:18:56Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models.
We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z) - Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning [11.139231636612479]
We propose Pruner and MoA-Pruner to speed up program tuning.<n>Pruner is a "Draft-then-Verify" exploration mechanism that accelerates the schedule search process.<n>MoA-Pruner introduces a Momentum online Adaptation strategy to address the cross-platform online unawareness.
arXiv Detail & Related papers (2024-02-04T06:11:12Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework.
We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - A Full-stack Accelerator Search Technique for Vision Applications [11.932331630567512]
We propose a hardware accelerator search framework that defines a broad optimization environment.
FAST can be used on any number and type of deep learning workload.
Designs generated by FAST optimized for single workloads can improve Perf/TDP by over 6x in the best case.
On a limited workload subset, FAST improves Perf/TDP 2.85x on average, with a reduction to 2.35x for a single design optimized over the set of workloads.
arXiv Detail & Related papers (2021-05-26T21:10:20Z) - SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost
Computation [97.78417228445883]
We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs)
We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2.
We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
arXiv Detail & Related papers (2020-05-07T12:12:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.