Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators
- URL: http://arxiv.org/abs/2601.21737v2
- Date: Fri, 30 Jan 2026 13:40:21 GMT
- Title: Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators
- Authors: Rebecca Pelke, Joel Klein, Jose Cubero-Cascante, Nils Bosbach, Jan Moritz Joseph, Rainer Leupers,
- Abstract summary: We propose a mixed-precision training and compilation framework for CIM architectures.<n>The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters.<n>In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.
- Score: 0.8708298560474775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computing-in-Memory (CIM) accelerators are a promising solution for accelerating Machine Learning (ML) workloads, as they perform Matrix-Vector Multiplications (MVMs) on crossbar arrays directly in memory. Although the bit widths of the crossbar inputs and cells are very limited, most CIM compilers do not support quantization below 8 bit. As a result, a single MVM requires many compute cycles, and weights cannot be efficiently stored in a single crossbar cell. To address this problem, we propose a mixed-precision training and compilation framework for CIM architectures. The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters. This is why we introduce a reinforcement learning-based strategy to find suitable quantization configurations that balance latency and accuracy. In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.
Related papers
- MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data
Capacity of SRAM-based Processing-In-Memory [6.367916611208411]
We propose DDC-PIM, an efficient algorithm/architecture co-design methodology that effectively doubles the equivalent data capacity.
DDC-PIM yields about $2.84times$ speedup on MobileNetV2 and $2.69times$ on EfficientNet-B0 with negligible accuracy loss.
Compared with state-of-the-art macros, DDC-PIM achieves up to $8.41times$ and $2.75times$ improvement in weight density and area efficiency, respectively.
arXiv Detail & Related papers (2023-10-31T12:49:54Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - A Transferable Approach for Partitioning Machine Learning Models on
Multi-Chip-Modules [8.224904698490626]
Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning accelerators.
We present a strategy using a deep reinforcement learning framework to emit a possibly invalid candidate partition that is then corrected by a constraint solver.
Our evaluation of a production-scale model, BERT, on real hardware reveals that the partitioning generated using RL policy achieves 6.11% and 5.85% higher throughput.
arXiv Detail & Related papers (2021-12-07T23:40:28Z) - OMPQ: Orthogonal Mixed Precision Quantization [72.63889596498004]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.<n>We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.<n>This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit
Sparsity of Neural Network [18.79036546647254]
We develop a novel ReRAM-based deep neural network (DNN) accelerator, named Sparse-Multiplication-Engine (SME)
First, we orchestrate the bit-sparse pattern to increase the density of bit-sparsity based on existing quantization methods.
Second, we propose a novel weigh mapping mechanism to slice the bits of a weight across the crossbars and splice the activation results in peripheral circuits.
Third, a superior squeeze-out scheme empties the crossbars mapped with highly-sparse non-zeros from the previous two steps.
arXiv Detail & Related papers (2021-03-02T13:27:15Z) - MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with
Co-designed Compressed Neural Networks [0.6817102408452476]
Convolutional neural networks (CNNs) play a key role in deep learning applications.
CIM architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication.
To reduce computation costs, network pruning and quantization are two widely studied compression methods to shrink the model size.
arXiv Detail & Related papers (2020-10-24T10:31:49Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.