Related papers: Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators

Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators

URL: http://arxiv.org/abs/2508.21524v1
Date: Fri, 29 Aug 2025 11:24:24 GMT
Title: Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators
Authors: Wenyong Zhou, Zhengwu Liu, Yuan Ren, Ngai Wong,
Abstract summary: We introduce a novel binary weight multi-bit activation (BWMA) method for CNNs on CIM-based accelerators.<n>Our contributions include deriving closed-form solutions for weight quantization in each layer, significantly improving the representational capabilities of binarized weights.<n>We show that BWMA achieves notable accuracy improvements over existing methods, registering gains of 1.44%-5.46% and 0.35%-5.37% on respective datasets.
Score: 19.034502382765755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compute-in-memory (CIM) accelerators have emerged as a promising way for enhancing the energy efficiency of convolutional neural networks (CNNs). Deploying CNNs on CIM platforms generally requires quantization of network weights and activations to meet hardware constraints. However, existing approaches either prioritize hardware efficiency with binary weight and activation quantization at the cost of accuracy, or utilize multi-bit weights and activations for greater accuracy but limited efficiency. In this paper, we introduce a novel binary weight multi-bit activation (BWMA) method for CNNs on CIM-based accelerators. Our contributions include: deriving closed-form solutions for weight quantization in each layer, significantly improving the representational capabilities of binarized weights; and developing a differentiable function for activation quantization, approximating the ideal multi-bit function while bypassing the extensive search for optimal settings. Through comprehensive experiments on CIFAR-10 and ImageNet datasets, we show that BWMA achieves notable accuracy improvements over existing methods, registering gains of 1.44\%-5.46\% and 0.35\%-5.37\% on respective datasets. Moreover, hardware simulation results indicate that 4-bit activation quantization strikes the optimal balance between hardware cost and model performance.

Related papers

Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation [18.963480523099694]
Spiking Neural Networks (SNNs) support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs)<n>We propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts.<n> Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods.
arXiv Detail & Related papers (2025-10-22T11:50:00Z)
Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators [0.20971479389679332]
Energy efficiency and memory footprint of a convolutional neural network (CNN) implemented on a CNN inference accelerator depend on many factors.<n>We show that enabling rich mixed quantization schemes during the implementation can open a previously hidden space of mappings.<n>CNNs utilizing quantized weights and activations and suitable mappings can significantly improve trade-offs among the accuracy, energy, and memory requirements.
arXiv Detail & Related papers (2024-04-08T10:10:30Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks. Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM. We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z)
BiTAT: Neural Network Binarization with Task-dependent Aggregated Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation. Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks [1.131071436917293]
Quantizing parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference. This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing.
arXiv Detail & Related papers (2022-06-15T18:11:37Z)
SWIS -- Shared Weight bIt Sparsity for Efficient Neural Network Acceleration [68.36996813591423]
Quantization is spearheading the increase in performance and efficiency of neural network computing systems. We present SWIS - Shared Weight bIt Sparsity, a quantization framework for efficient neural network inference acceleration.
arXiv Detail & Related papers (2021-03-01T21:03:20Z)
Activation Density based Mixed-Precision Quantization for Energy Efficient Neural Networks [2.666640112616559]
We propose an in-training quantization method for neural network models. Our method calculates bit-width for each layer during training a mixed precision model with competitive accuracy. We run experiments on benchmark datasets like CIFAR-10, CIFAR-100, TinyImagenet on VGG19/ResNet18 architectures.
arXiv Detail & Related papers (2021-01-12T09:01:44Z)
PAMS: Quantized Super-Resolution via Parameterized Max Scale [84.55675222525608]
Deep convolutional neural networks (DCNNs) have shown dominant performance in the task of super-resolution (SR) We propose a new quantization scheme termed PArameterized Max Scale (PAMS), which applies the trainable truncated parameter to explore the upper bound of the quantization range adaptively. Experiments demonstrate that the proposed PAMS scheme can well compress and accelerate the existing SR models such as EDSR and RDN.
arXiv Detail & Related papers (2020-11-09T06:16:05Z)
Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z)
Efficient Bitwidth Search for Practical Mixed Precision Neural Network [33.80117489791902]
Network quantization has rapidly become one of the most widely used methods to compress and accelerate deep neural networks. Recent efforts propose to quantize weights and activations from different layers with different precision to improve the overall performance. It is challenging to find the optimal bitwidth (i.e., precision) for weights and activations of each layer efficiently. It is yet unclear how to perform convolution for weights and activations of different precision efficiently on generic hardware platforms.
arXiv Detail & Related papers (2020-03-17T08:27:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.