Related papers: StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators

StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators

URL: http://arxiv.org/abs/2407.12378v1
Date: Wed, 17 Jul 2024 07:56:43 GMT
Title: StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators
Authors: Ethan G Rogers, Sohan Salahuddin Mugdho, Kshemal Kshemendra Gupte, Cheng Wang,
Abstract summary: Crossbarsoftware-based in-memory computing (IMC) has emerged as a promising platform for hardware acceleration of deep neural networks (DNNs) However, the energy and latency of IMC systems are dominated by the large overhead of the peripheral analog-to-digital converters (ADCs)
Score: 5.245727758971415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Crossbar-based in-memory computing (IMC) has emerged as a promising platform for hardware acceleration of deep neural networks (DNNs). However, the energy and latency of IMC systems are dominated by the large overhead of the peripheral analog-to-digital converters (ADCs). To address such ADC bottleneck, here we propose to implement stochastic processing of array-level partial sums (PS) for efficient IMC. Leveraging the probabilistic switching of spin-orbit torque magnetic tunnel junctions, the proposed PS processing eliminates the costly ADC, achieving significant improvement in energy and area efficiency. To mitigate accuracy loss, we develop PS-quantization-aware training that enables backward propagation across stochastic PS. Furthermore, a novel scheme with an inhomogeneous sampling length of the stochastic conversion is proposed. When running ResNet20 on the CIFAR-10 dataset, our architecture-to-algorithm co-design demonstrates up to 22x, 30x, and 142x improvement in energy, latency, and area, respectively, compared to IMC with standard ADC. Our optimized design configuration using stochastic PS achieved 666x (111x) improvement in Energy-Delay-Product compared to IMC with full precision ADC (sparse low-bit ADC), while maintaining near-software accuracy at various benchmark classification tasks.

Related papers

An Event-Based Digital Compute-In-Memory Accelerator with Flexible Operand Resolution and Layer-Wise Weight/Output Stationarity [0.11522790873450185]
CIM accelerators for spiking neural networks (SNNs) are promising solutions to enable $mu$s-level inference latency and ultra-low energy in edge vision applications. We propose a novel digital CIM macro that supports arbitrary operand resolution and shape, with a unified CIM storage for weights and membrane potentials. Our approach can save up to 90% energy in large-scale systems, while reaching a state-of-the-art classification accuracy of 95.8% on the IBM DVS gesture dataset.
arXiv Detail & Related papers (2024-10-30T14:55:13Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
Full-Stack Optimization for CAM-Only DNN Inference [2.0837295518447934]
This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors. We propose a novel compilation flow to optimize convolutions on APs by reducing their arithmetic intensity. Our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators.
arXiv Detail & Related papers (2024-01-23T10:27:38Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency Transformation [2.7488316163114823]
This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations. Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix. On a 16$times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt.
arXiv Detail & Related papers (2023-09-04T19:19:39Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Hardware/Software co-design with ADC-Less In-memory Computing Hardware for Spiking Neural Networks [4.7519630770389405]
Spiking Neural Networks (SNNs) are bio-plausible models that hold great potential for realizing energy-efficient implementations of sequential tasks on resource-constrained edge devices. We propose a hardware/software co-design methodology to deploy SNNs into an ADC-Less IMC architecture using sense-amplifiers as 1-bit ADCs replacing conventional HP-ADCs. Our proposed framework incurs minimal accuracy degradation by performing hardware-aware training and is able to scale beyond simple image classification tasks to more complex sequential regression tasks.
arXiv Detail & Related papers (2022-11-03T22:37:49Z)
Single-Shot Optical Neural Network [55.41644538483948]
'Weight-stationary' analog optical and electronic hardware has been proposed to reduce the compute resources required by deep neural networks. We present a scalable, single-shot-per-layer weight-stationary optical processor.
arXiv Detail & Related papers (2022-05-18T17:49:49Z)
Collaborative Intelligent Reflecting Surface Networks with Multi-Agent Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks. In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z)
Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals [11.31429464715989]
This paper presents a new PIM architecture to efficiently accelerate deep learning tasks. It is proposed to minimize the required A/D conversions with analog accumulation and neural approximated peripheral circuits. Evaluations on different benchmarks demonstrate that Neural-PIM can improve energy efficiency by 5.36x (1.73x) and speed up throughput by 3.43x (1.59x) without losing accuracy.
arXiv Detail & Related papers (2022-01-30T16:14:49Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.