PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation
- URL: http://arxiv.org/abs/2408.16246v1
- Date: Thu, 29 Aug 2024 03:58:19 GMT
- Title: PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation
- Authors: Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka,
- Abstract summary: This paper introduces a novel probabilistic approximate computation (PAC) method that reduces approximation error by 4X compared to existing approaches.
PAC enables efficient sparsity-based computation in compute-in-memory (CiM) systems by simplifying complex MAC vector computations into scalar calculations.
We develop PACiM, a sparsity-centric architecture that fully exploits sparsity to reduce bit-serial cycles by 81% and achieves a peak 8b/8b efficiency of 14.63 TOPS/W in 65 nm CMOS.
- Score: 1.2848824355101671
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Approximate computing emerges as a promising approach to enhance the efficiency of compute-in-memory (CiM) systems in deep neural network processing. However, traditional approximate techniques often significantly trade off accuracy for power efficiency, and fail to reduce data transfer between main memory and CiM banks, which dominates power consumption. This paper introduces a novel probabilistic approximate computation (PAC) method that leverages statistical techniques to approximate multiply-and-accumulation (MAC) operations, reducing approximation error by 4X compared to existing approaches. PAC enables efficient sparsity-based computation in CiM systems by simplifying complex MAC vector computations into scalar calculations. Moreover, PAC enables sparsity encoding and eliminates the LSB activations transmission, significantly reducing data reads and writes. This sets PAC apart from traditional approximate computing techniques, minimizing not only computation power but also memory accesses by 50%, thereby boosting system-level efficiency. We developed PACiM, a sparsity-centric architecture that fully exploits sparsity to reduce bit-serial cycles by 81% and achieves a peak 8b/8b efficiency of 14.63 TOPS/W in 65 nm CMOS while maintaining high accuracy of 93.85/72.36/66.02% on CIFAR-10/CIFAR-100/ImageNet benchmarks using a ResNet-18 model, demonstrating the effectiveness of our PAC methodology.
Related papers
- Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference.
We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization.
We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z) - ISO: Overlap of Computation and Communication within Seqenence For LLM Inference [8.616769297336708]
This paper introduces a novel strategy for computation-communication overlap that operates at the sequence level.
Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency.
arXiv Detail & Related papers (2024-09-04T05:22:17Z) - A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit
for Analog In-Memory Computing [10.992736723518036]
We propose a Near-Memory digital Processing Unit (NMPU) based on fixed-point arithmetic.
It achieves competitive accuracy and higher computing throughput than previous approaches.
We validate the efficacy of the NMPU by using data from an AIMC chip and demonstrate that a simulated AIMC system with the proposed NMPU outperforms existing FP16-based implementations.
arXiv Detail & Related papers (2024-02-12T10:30:45Z) - CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory
Architectures [0.1747623282473278]
We present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures.
We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms.
arXiv Detail & Related papers (2024-01-15T13:35:21Z) - Fast, Scalable, Warm-Start Semidefinite Programming with Spectral
Bundling and Sketching [53.91395791840179]
We present Unified Spectral Bundling with Sketching (USBS), a provably correct, fast and scalable algorithm for solving massive SDPs.
USBS provides a 500x speed-up over the state-of-the-art scalable SDP solver on an instance with over 2 billion decision variables.
arXiv Detail & Related papers (2023-12-19T02:27:22Z) - Pruning random resistive memory for optimizing analogue AI [54.21621702814583]
AI models present unprecedented challenges to energy consumption and environmental sustainability.
One promising solution is to revisit analogue computing, a technique that predates digital computing.
Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning.
arXiv Detail & Related papers (2023-11-13T08:59:01Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Learning Centric Power Allocation for Edge Intelligence [84.16832516799289]
Edge intelligence has been proposed, which collects distributed data and performs machine learning at the edge.
This paper proposes a learning centric power allocation (LCPA) method, which allocates radio resources based on an empirical classification error model.
Experimental results show that the proposed LCPA algorithm significantly outperforms other power allocation algorithms.
arXiv Detail & Related papers (2020-07-21T07:02:07Z) - Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet
Implementation for Edge Motor-Imagery Brain--Machine Interfaces [16.381467082472515]
Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines.
Deep learning models have emerged for classifying EEG signals.
These models often exceed the limitations of edge devices due to their memory and computational requirements.
arXiv Detail & Related papers (2020-04-24T12:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.