A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit
for Analog In-Memory Computing
- URL: http://arxiv.org/abs/2402.07549v1
- Date: Mon, 12 Feb 2024 10:30:45 GMT
- Title: A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit
for Analog In-Memory Computing
- Authors: Elena Ferro, Athanasios Vasilopoulos, Corey Lammie, Manuel Le Gallo,
Luca Benini, Irem Boybat, Abu Sebastian
- Abstract summary: We propose a Near-Memory digital Processing Unit (NMPU) based on fixed-point arithmetic.
It achieves competitive accuracy and higher computing throughput than previous approaches.
We validate the efficacy of the NMPU by using data from an AIMC chip and demonstrate that a simulated AIMC system with the proposed NMPU outperforms existing FP16-based implementations.
- Score: 10.992736723518036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Analog In-Memory Computing (AIMC) is an emerging technology for fast and
energy-efficient Deep Learning (DL) inference. However, a certain amount of
digital post-processing is required to deal with circuit mismatches and
non-idealities associated with the memory devices. Efficient near-memory
digital logic is critical to retain the high area/energy efficiency and low
latency of AIMC. Existing systems adopt Floating Point 16 (FP16) arithmetic
with limited parallelization capability and high latency. To overcome these
limitations, we propose a Near-Memory digital Processing Unit (NMPU) based on
fixed-point arithmetic. It achieves competitive accuracy and higher computing
throughput than previous approaches while minimizing the area overhead.
Moreover, the NMPU supports standard DL activation steps, such as ReLU and
Batch Normalization. We perform a physical implementation of the NMPU design in
a 14 nm CMOS technology and provide detailed performance, power, and area
assessments. We validate the efficacy of the NMPU by using data from an AIMC
chip and demonstrate that a simulated AIMC system with the proposed NMPU
outperforms existing FP16-based implementations, providing 139$\times$
speed-up, 7.8$\times$ smaller area, and a competitive power consumption.
Additionally, our approach achieves an inference accuracy of 86.65 %/65.06 %,
with an accuracy drop of just 0.12 %/0.4 % compared to the FP16 baseline when
benchmarked with ResNet9/ResNet32 networks trained on the CIFAR10/CIFAR100
datasets, respectively.
Related papers
- PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation [1.2848824355101671]
This paper introduces a novel probabilistic approximate computation (PAC) method that reduces approximation error by 4X compared to existing approaches.
PAC enables efficient sparsity-based computation in compute-in-memory (CiM) systems by simplifying complex MAC vector computations into scalar calculations.
We develop PACiM, a sparsity-centric architecture that fully exploits sparsity to reduce bit-serial cycles by 81% and achieves a peak 8b/8b efficiency of 14.63 TOPS/W in 65 nm CMOS.
arXiv Detail & Related papers (2024-08-29T03:58:19Z) - StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators [5.245727758971415]
Crossbarsoftware-based in-memory computing (IMC) has emerged as a promising platform for hardware acceleration of deep neural networks (DNNs)
However, the energy and latency of IMC systems are dominated by the large overhead of the peripheral analog-to-digital converters (ADCs)
arXiv Detail & Related papers (2024-07-17T07:56:43Z) - Full-Stack Optimization for CAM-Only DNN Inference [2.0837295518447934]
This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors.
We propose a novel compilation flow to optimize convolutions on APs by reducing their arithmetic intensity.
Our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators.
arXiv Detail & Related papers (2024-01-23T10:27:38Z) - Pruning random resistive memory for optimizing analogue AI [54.21621702814583]
AI models present unprecedented challenges to energy consumption and environmental sustainability.
One promising solution is to revisit analogue computing, a technique that predates digital computing.
Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning.
arXiv Detail & Related papers (2023-11-13T08:59:01Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems [61.335229621081346]
Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge.
In this paper, we propose FLEdge, which complements existing FL benchmarks by enabling a systematic evaluation of client capabilities.
arXiv Detail & Related papers (2023-06-08T13:11:20Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Single-Shot Optical Neural Network [55.41644538483948]
'Weight-stationary' analog optical and electronic hardware has been proposed to reduce the compute resources required by deep neural networks.
We present a scalable, single-shot-per-layer weight-stationary optical processor.
arXiv Detail & Related papers (2022-05-18T17:49:49Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet
Implementation for Edge Motor-Imagery Brain--Machine Interfaces [16.381467082472515]
Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines.
Deep learning models have emerged for classifying EEG signals.
These models often exceed the limitations of edge devices due to their memory and computational requirements.
arXiv Detail & Related papers (2020-04-24T12:29:03Z) - ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for
Deep Learning [1.2019888796331233]
Matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) are the two most expensive operations associated with the training of deep neural networks (DNNs)
We introduce efficient techniques to SC for weight update in DNNs with the activation functions required by many state-of-the-art networks.
Our architecture reduces the computational cost by re-using random numbers and replacing certain FP multiplication operations by bit shift scaling.
Hardware design of ESSOP at 14nm technology node shows that, compared to a highly pipelined FP16 multiplier, ESSOP is 82.2% and 93.7% better in energy
arXiv Detail & Related papers (2020-03-25T07:54:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.