Related papers: DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models

DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models

URL: http://arxiv.org/abs/2601.06724v1
Date: Sat, 10 Jan 2026 23:56:33 GMT
Title: DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models
Authors: Kunming Shao, Liang Zhao, Jiangnan Yu, Zhipeng Liao, Xiaomeng Wang, Yi Zou, Tim Kwang-Ting Cheng, Chi-Ying Tsui,
Abstract summary: This paper introduces a digital CIM (DS-CIM) architecture that achieves both high accuracy and efficiency.<n>We implement multiply-accumulation (MAC) in a compact, unsigned OR-based circuit by modifying the data representation.<n>Our core strategy, a shared Random Number Generator (PRNG) with 2D, enables single-cycle mutually exclusive activation to eliminate OR-gate collisions.
Score: 8.92683306412944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic computing (SC) offers hardware simplicity but suffers from low throughput, while high-throughput Digital Computing-in-Memory (DCIM) is bottlenecked by costly adder logic for matrix-vector multiplication (MVM). To address this trade-off, this paper introduces a digital stochastic CIM (DS-CIM) architecture that achieves both high accuracy and efficiency. We implement signed multiply-accumulation (MAC) in a compact, unsigned OR-based circuit by modifying the data representation. Throughput is enhanced by replicating this low-cost circuit 64 times with only a 1x area increase. Our core strategy, a shared Pseudo Random Number Generator (PRNG) with 2D partitioning, enables single-cycle mutually exclusive activation to eliminate OR-gate collisions. We also resolve the 1s saturation issue via stochastic process analysis and data remapping, significantly improving accuracy and resilience to input sparsity. Our high-accuracy DS-CIM1 variant achieves 94.45% accuracy for INT8 ResNet18 on CIFAR-10 with a root-mean-squared error (RMSE) of just 0.74%. Meanwhile, our high-efficiency DS-CIM2 variant attains an energy efficiency of 3566.1 TOPS/W and an area efficiency of 363.7 TOPS/mm^2, while maintaining a low RMSE of 3.81%. The DS-CIM capability with larger models is further demonstrated through experiments with INT8 ResNet50 on ImageNet and the FP8 LLaMA-7B model.

Related papers

SPADE: A SIMD Posit-enabled compute engine for Accelerating DNN Efficiency [0.12314765641075437]
This work presents SPADE, a unified multi-precision SIMD Posit-based multiplyaccumulate (MAC) architecture.<n>Unlike prior single-precision or floating/fixed-point SIMD MACs, SPADE introduces a regime-aware, lane-fused SIMD Posit datapath.<n> FPGA implementation on a Xilinx Virtex-7 shows 45.13% LUT and 80% slice reduction for Posit (8,0), and up to 28.44% and 17.47% improvement for Posit (16,1) and Posit (32,2) over prior work.
arXiv Detail & Related papers (2026-01-24T03:38:11Z)
MeanFlow Transformers with Representation Autoencoders [71.45823902973349]
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data.<n>We develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE)<n>We achieve a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256.
arXiv Detail & Related papers (2025-11-17T06:17:08Z)
Flow-GRPO: Training Flow Matching Models via Online RL [80.62659379624867]
We propose Flow-GRPO, the first method to integrate online policy reinforcement learning into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps.
arXiv Detail & Related papers (2025-05-08T17:58:45Z)
Column-wise Quantization of Weights and Partial Sums for Accurate and Efficient Compute-In-Memory Accelerators [7.728820930581886]
CIM is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead.<n>Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy.<n>This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level.
arXiv Detail & Related papers (2025-02-11T05:32:14Z)
IMAGINE: An 8-to-1b 22nm FD-SOI Compute-In-Memory CNN Accelerator With an End-to-End Analog Charge-Based 0.15-8POPS/W Macro Featuring Distribution-Aware Data Reshaping [0.6071203743728119]
We present IMAGINE, a workload-adaptive 1-to-8b CIM-CNN accelerator in 22nm FD-SOI.<n>It introduces a 1152x256 end-to-end charge-based macro with a multi-bit DP based on an input-serial, weight-parallel accumulation that avoids power-hungry DACs.<n>Measurement results showcase an 8b system-level energy efficiency of 40TOPS/W at 0.3/0.6V, with competitive accuracies on MNIST and CIFAR-10.
arXiv Detail & Related papers (2024-12-27T17:18:15Z)
Synergistic Development of Perovskite Memristors and Algorithms for Robust Analog Computing [53.77822620185878]
We propose a synergistic methodology to concurrently optimize perovskite memristor fabrication and develop robust analog DNNs.<n>We develop "BayesMulti", a training strategy utilizing BO-guided noise injection to improve the resistance of analog DNNs to memristor imperfections.<n>Our integrated approach enables use of analog computing in much deeper and wider networks, achieving up to 100-fold improvements.
arXiv Detail & Related papers (2024-12-03T19:20:08Z)
LCM: Locally Constrained Compact Point Cloud Model for Masked Point Modeling [47.94285833315427]
We propose a Locally constrained Compact point cloud Model (LCM) consisting of a locally constrained compact encoder and a locally constrained Mamba-based decoder. Our encoder replaces self-attention with our local aggregation layers to achieve an elegant balance between performance and efficiency. This decoder ensures linear complexity while maximizing the perception of point cloud geometry information from unmasked patches with higher information density.
arXiv Detail & Related papers (2024-05-27T13:19:23Z)
SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models. SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z)
Pruning random resistive memory for optimizing analogue AI [54.21621702814583]
AI models present unprecedented challenges to energy consumption and environmental sustainability. One promising solution is to revisit analogue computing, a technique that predates digital computing. Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning.
arXiv Detail & Related papers (2023-11-13T08:59:01Z)
AnalogNets: ML-HW Co-Design of Noise-robust TinyML Models and Always-On Analog Compute-in-Memory Accelerator [50.31646817567764]
This work describes TinyML models for the popular always-on applications of keyword spotting (KWS) and visual wake words (VWW) We detail a comprehensive training methodology, to retain accuracy in the face of analog non-idealities. We also describe AON-CiM, a programmable, minimal-area phase-change memory (PCM) analog CiM accelerator.
arXiv Detail & Related papers (2021-11-10T10:24:46Z)
SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable Accuracy [3.4154033825543055]
This paper presents for the first time, an SIMD architecture based on novel multiplier and divider with tunable accuracy. The proposed hybrid architecture implements Mitchell's algorithms and supports precision variability from 8 to 32 bits.
arXiv Detail & Related papers (2020-11-02T17:40:44Z)
An Accurate EEGNet-based Motor-Imagery Brain-Computer Interface for Low-Power Edge Computing [13.266626571886354]
This paper presents an accurate and robust embedded motor-imagery brain-computer interface (MI-BCI) The proposed novel model, based on EEGNet, matches the requirements of memory footprint and computational resources of low-power microcontroller units (MCUs) The scaled models are deployed on a commercial Cortex-M4F MCU taking 101ms and consuming 4.28mJ per inference for operating the smallest model, and on a Cortex-M7 with 44ms and 18.1mJ per inference for the medium-sized model.
arXiv Detail & Related papers (2020-03-31T19:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.