MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with
Co-designed Compressed Neural Networks
- URL: http://arxiv.org/abs/2010.12861v2
- Date: Tue, 25 May 2021 05:38:22 GMT
- Title: MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with
Co-designed Compressed Neural Networks
- Authors: Syuan-Hao Sie, Jye-Luen Lee, Yi-Ren Chen, Chih-Cheng Lu, Chih-Cheng
Hsieh, Meng-Fan Chang, Kea-Tiong Tang
- Abstract summary: Convolutional neural networks (CNNs) play a key role in deep learning applications.
CIM architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication.
To reduce computation costs, network pruning and quantization are two widely studied compression methods to shrink the model size.
- Score: 0.6817102408452476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks (CNNs) play a key role in deep learning
applications. However, the large storage overheads and the substantial
computation cost of CNNs are problematic in hardware accelerators.
Computing-in-memory (CIM) architecture has demonstrated great potential to
effectively compute large-scale matrix-vector multiplication. However, the
intensive multiply and accumulation (MAC) operations executed at the crossbar
array and the limited capacity of CIM macros remain bottlenecks for further
improvement of energy efficiency and throughput. To reduce computation costs,
network pruning and quantization are two widely studied compression methods to
shrink the model size. However, most of the model compression algorithms can
only be implemented in digital-based CNN accelerators. For implementation in a
static random access memory (SRAM) CIM-based accelerator, the model compression
algorithm must consider the hardware limitations of CIM macros, such as the
number of word lines and bit lines that can be turned on at the same time, as
well as how to map the weight to the SRAM CIM macro. In this study, a software
and hardware co-design approach is proposed to design an SRAM CIM-based CNN
accelerator and an SRAM CIM-aware model compression algorithm. To lessen the
high-precision MAC required by batch normalization (BN), a quantization
algorithm that can fuse BN into the weights is proposed. Furthermore, to reduce
the number of network parameters, a sparsity algorithm that considers a CIM
architecture is proposed. Last, MARS, a CIM-based CNN accelerator that can
utilize multiple SRAM CIM macros as processing units and support a sparsity
neural network, is proposed.
Related papers
- BasisN: Reprogramming-Free RRAM-Based In-Memory-Computing by Basis Combination for Deep Neural Networks [9.170451418330696]
We propose BasisN framework to accelerate deep neural networks (DNNs) on any number of crossbars without reprogramming.
We show that cycles per inference and energy-delay product were reduced to below 1% compared with applying reprogramming on crossbars.
arXiv Detail & Related papers (2024-07-04T08:47:05Z) - Dynamic Semantic Compression for CNN Inference in Multi-access Edge
Computing: A Graph Reinforcement Learning-based Autoencoder [82.8833476520429]
We propose a novel semantic compression method, autoencoder-based CNN architecture (AECNN) for effective semantic extraction and compression in partial offloading.
In the semantic encoder, we introduce a feature compression module based on the channel attention mechanism in CNNs, to compress intermediate data by selecting the most informative features.
In the semantic decoder, we design a lightweight decoder to reconstruct the intermediate data through learning from the received compressed data to improve accuracy.
arXiv Detail & Related papers (2024-01-19T15:19:47Z) - CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory
Architectures [0.1747623282473278]
We present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures.
We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms.
arXiv Detail & Related papers (2024-01-15T13:35:21Z) - EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality.
On the software side, we evaluate epitomes' latency and energy on PIM accelerators.
We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z) - DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data
Capacity of SRAM-based Processing-In-Memory [6.367916611208411]
We propose DDC-PIM, an efficient algorithm/architecture co-design methodology that effectively doubles the equivalent data capacity.
DDC-PIM yields about $2.84times$ speedup on MobileNetV2 and $2.69times$ on EfficientNet-B0 with negligible accuracy loss.
Compared with state-of-the-art macros, DDC-PIM achieves up to $8.41times$ and $2.75times$ improvement in weight density and area efficiency, respectively.
arXiv Detail & Related papers (2023-10-31T12:49:54Z) - Stochastic Configuration Machines: FPGA Implementation [4.57421617811378]
configuration networks (SCNs) are a prime choice in industrial applications due to their merits and feasibility for data modelling.
This paper aims to implement SCM models on a field programmable gate array (FPGA) and introduce binary-coded inputs to improve learning performance.
arXiv Detail & Related papers (2023-10-30T02:04:20Z) - Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures.
This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead.
We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z) - PSCNN: A 885.86 TOPS/W Programmable SRAM-based Computing-In-Memory
Processor for Keyword Spotting [0.10547353841674209]
This paper proposes a programmable CIM processor with a single large sized CIM macro instead of multiple smaller ones for power efficient computation.
The proposed architecture adopts the pooling write-back method to support fused or independent convolution/pooling operations to reduce 35.9% of latency.
The design fabricated in TSMC 28nm technology achieves 150.8 GOPS throughput and 885.86 TOPS/W power efficiency at 10 MHz when executing our binary keyword spotting model.
arXiv Detail & Related papers (2022-05-02T09:58:18Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.