Related papers: PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications

PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications

URL: http://arxiv.org/abs/2302.01876v1
Date: Fri, 3 Feb 2023 17:26:12 GMT
Title: PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications
Authors: Qiong Li, Chao Fang, Zhongfeng Wang
Abstract summary: Posit has been a promising alternative to the IEEE-754 floating point format for deep learning applications. It has been implemented by either the combination of multipliers and an adder tree or cascaded fused multiply-add units, leading to poor computational efficiency and excessive hardware overhead. We propose an open-source posit dot-product unit, namely PDPU, that facilitates resource-efficient and high- throughput dot-product hardware implementation.
Score: 9.253002604030085
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Posit has been a promising alternative to the IEEE-754 floating point format for deep learning applications due to its better trade-off between dynamic range and accuracy. However, hardware implementation of posit arithmetic requires further exploration, especially for the dot-product operations dominated in deep neural networks (DNNs). It has been implemented by either the combination of multipliers and an adder tree or cascaded fused multiply-add units, leading to poor computational efficiency and excessive hardware overhead. To address this issue, we propose an open-source posit dot-product unit, namely PDPU, that facilitates resource-efficient and high-throughput dot-product hardware implementation. PDPU not only features the fused and mixed-precision architecture that eliminates redundant latency and hardware resources, but also has a fine-grained 6-stage pipeline, improving computational efficiency. A configurable PDPU generator is further developed to meet the diverse needs of various DNNs for computational accuracy. Experimental results evaluated under the 28nm CMOS process show that PDPU reduces area, latency, and power by up to 43%, 64%, and 70%, respectively, compared to the existing implementations. Hence, PDPU has great potential as the computing core of posit-based accelerators for deep learning applications.

Related papers

RAMAN: Resource-efficient ApproxiMate Posit Processing for Algorithm-Hardware Co-desigN [0.0]
This work presents RAMAN, a resource-efficient and approximate posit(8,2)-based Multiply-Accumulate (MAC) architecture.<n>The proposed REAP MAC engine, which is at the core of RAMAN, uses approximation in the posit multiplier to achieve significant area and power reductions.
arXiv Detail & Related papers (2025-10-26T11:09:06Z)
Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers [0.0]
This paper evaluates the impact of Neural Processing Units (NPUs) on machine learning (ML) execution on microcontrollers (MCUs)<n>It shows substantial efficiency gains when inference is offloaded to the NPU.<n>For moderate to large networks, latency improvements ranged from 7x to over 125x, with per-inference net energy reductions up to 143x.
arXiv Detail & Related papers (2025-09-22T08:52:54Z)
Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing [67.98609858326951]
Intra-DP is a high-performance collaborative inference system optimized for deep neural networks (DNNs) on mobile devices.<n>It reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.<n>The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-07-08T09:50:57Z)
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs [1.9938412996898076]
Unary-based matrix multiplication hardware aims to leverage data sparsity and low-precision values to enhance hardware efficiency. integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers.
arXiv Detail & Related papers (2024-12-25T23:20:02Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA [20.629635991749808]
This paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs. At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads. At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient BayesNNs.
arXiv Detail & Related papers (2024-06-20T17:08:42Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Towards making the most of NLP-based device mapping optimization for OpenCL kernels [5.6596607119831575]
We extend the work of Cummins et al., namely Deeptune, that tackles the problem of optimal device selection ( CPU or GPU) for accelerated OpenCL kernels. We propose four different models that provide enhanced contextual information of source codes. Experimental results show that our proposed methodology surpasses that of Cummins et al. work, providing up to 4% improvement in prediction accuracy.
arXiv Detail & Related papers (2022-08-30T10:20:55Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z)
PLAM: a Posit Logarithm-Approximate Multiplier for Power Efficient Posit-based DNNs [8.623938357911467]
The Posit Number System was introduced in 2017 as a replacement for floating-point numbers. This paper proposes a Posit Logarithm-Approximate multiplication scheme to significantly reduce the complexity of posit multipliers. Experiments show that the proposed technique reduces the area, power, and delay of hardware multipliers up to 72.86%, 81.79%, and 17.01%, respectively, without accuracy degradation.
arXiv Detail & Related papers (2021-02-18T10:43:07Z)
ALF: Autoencoder-based Low-rank Filter-sharing for Efficient Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF) ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z)
Dopant Network Processing Units: Towards Efficient Neural-network Emulators with High-capacity Nanoelectronic Nodes [0.0]
"Dopant Network Processing Units" (DNPUs) are highly energy-efficient and have potentially very high throughput. We introduce DNPUs as high-capacity neurons and move from a single to a multi-neuron framework. We show that feed-forward DNPU networks improve the performance of a single DNPU from 77% to 94% test accuracy.
arXiv Detail & Related papers (2020-07-24T06:35:44Z)
ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for Deep Learning [1.2019888796331233]
Matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) are the two most expensive operations associated with the training of deep neural networks (DNNs) We introduce efficient techniques to SC for weight update in DNNs with the activation functions required by many state-of-the-art networks. Our architecture reduces the computational cost by re-using random numbers and replacing certain FP multiplication operations by bit shift scaling. Hardware design of ESSOP at 14nm technology node shows that, compared to a highly pipelined FP16 multiplier, ESSOP is 82.2% and 93.7% better in energy
arXiv Detail & Related papers (2020-03-25T07:54:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.