Related papers: Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators

Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators

URL: http://arxiv.org/abs/2601.21737v2
Date: Fri, 30 Jan 2026 13:40:21 GMT
Title: Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators
Authors: Rebecca Pelke, Joel Klein, Jose Cubero-Cascante, Nils Bosbach, Jan Moritz Joseph, Rainer Leupers,
Abstract summary: We propose a mixed-precision training and compilation framework for CIM architectures.<n>The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters.<n>In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.
Score: 0.8708298560474775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computing-in-Memory (CIM) accelerators are a promising solution for accelerating Machine Learning (ML) workloads, as they perform Matrix-Vector Multiplications (MVMs) on crossbar arrays directly in memory. Although the bit widths of the crossbar inputs and cells are very limited, most CIM compilers do not support quantization below 8 bit. As a result, a single MVM requires many compute cycles, and weights cannot be efficiently stored in a single crossbar cell. To address this problem, we propose a mixed-precision training and compilation framework for CIM architectures. The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters. This is why we introduce a reinforcement learning-based strategy to find suitable quantization configurations that balance latency and accuracy. In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.

Related papers

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory [6.367916611208411]
We propose DDC-PIM, an efficient algorithm/architecture co-design methodology that effectively doubles the equivalent data capacity. DDC-PIM yields about $2.84times$ speedup on MobileNetV2 and $2.69times$ on EfficientNet-B0 with negligible accuracy loss. Compared with state-of-the-art macros, DDC-PIM achieves up to $8.41times$ and $2.75times$ improvement in weight density and area efficiency, respectively.
arXiv Detail & Related papers (2023-10-31T12:49:54Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules [8.224904698490626]
Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning accelerators. We present a strategy using a deep reinforcement learning framework to emit a possibly invalid candidate partition that is then corrected by a constraint solver. Our evaluation of a production-scale model, BERT, on real hardware reveals that the partitioning generated using RL policy achieves 6.11% and 5.85% higher throughput.
arXiv Detail & Related papers (2021-12-07T23:40:28Z)
OMPQ: Orthogonal Mixed Precision Quantization [72.63889596498004]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.<n>We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.<n>This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z)
SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network [18.79036546647254]
We develop a novel ReRAM-based deep neural network (DNN) accelerator, named Sparse-Multiplication-Engine (SME) First, we orchestrate the bit-sparse pattern to increase the density of bit-sparsity based on existing quantization methods. Second, we propose a novel weigh mapping mechanism to slice the bits of a weight across the crossbars and splice the activation results in peripheral circuits. Third, a superior squeeze-out scheme empties the crossbars mapped with highly-sparse non-zeros from the previous two steps.
arXiv Detail & Related papers (2021-03-02T13:27:15Z)
MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with Co-designed Compressed Neural Networks [0.6817102408452476]
Convolutional neural networks (CNNs) play a key role in deep learning applications. CIM architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication. To reduce computation costs, network pruning and quantization are two widely studied compression methods to shrink the model size.
arXiv Detail & Related papers (2020-10-24T10:31:49Z)
Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices. Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors. Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.