OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads
- URL: http://arxiv.org/abs/2508.08822v1
- Date: Tue, 12 Aug 2025 10:24:33 GMT
- Title: OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads
- Authors: Shady Agwa, Yihan Pan, Georgios Papandroulidakis, Themis Prodromakis,
- Abstract summary: OISMA is a novel in-memory computing architecture that utilizes the computational simplicity of a quasi-stochastic computing domain (Bent-Pyramid system)<n>OISMA converts normal memory read operations into in-situ multiplication operations with a negligible cost.<n>The accuracy results show a significant decrease in the average relative Frobenius error, from 9.42% (for 4x4) to 1.81% (for 512x512)
- Score: 0.2796197251957244
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Artificial Intelligence models are currently driven by a significant up-scaling of their complexity, with massive matrix multiplication workloads representing the major computational bottleneck. In-memory computing architectures are proposed to avoid the Von Neumann bottleneck. However, both digital/binary-based and analogue in-memory computing architectures suffer from various limitations, which significantly degrade the performance and energy efficiency gains. This work proposes OISMA, a novel in-memory computing architecture that utilizes the computational simplicity of a quasi-stochastic computing domain (Bent-Pyramid system), while keeping the same efficiency, scalability, and productivity of digital memories. OISMA converts normal memory read operations into in-situ stochastic multiplication operations with a negligible cost. An accumulation periphery then accumulates the output multiplication bitstreams, achieving the matrix multiplication functionality. Extensive matrix multiplication benchmarking was conducted to analyze the accuracy of the Bent-Pyramid system, using matrix dimensions ranging from 4x4 to 512x512. The accuracy results show a significant decrease in the average relative Frobenius error, from 9.42% (for 4x4) to 1.81% (for 512x512), compared to 64-bit double precision floating-point format. A 1T1R OISMA array of 4 KB capacity was implemented using a commercial 180nm technology node and in-house RRAM technology. At 50 MHz, OISMA achieves 0.891 TOPS/W and 3.98 GOPS/mm2 for energy and area efficiency, respectively, occupying an effective computing area of 0.804241 mm2. Scaling OISMA from 180nm to 22nm technology shows a significant improvement of two orders of magnitude in energy efficiency and one order of magnitude in area efficiency, compared to dense matrix multiplication in-memory computing architectures.
Related papers
- POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation [57.57816409869894]
We introduce POET-X, a scalable and memory-efficient variant for training large language models.<n>PoET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency.
arXiv Detail & Related papers (2026-03-05T18:59:23Z) - Implementation of high-efficiency, lightweight residual spiking neural network processor based on field-programmable gate arrays [0.49806798459446283]
This work presents an efficient residual SNN accelerator that combines algorithm and hardware co-design to optimize inference energy efficiency.<n>The proposed processor achieves a classification accuracy of 87.11% on the CIFAR-10 dataset, with an inference time of 3.98 ms per image and an energy efficiency of 183.5 FPS/W.
arXiv Detail & Related papers (2025-12-09T02:08:46Z) - ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization [99.96330641363396]
ARMOR: (Adaptive Representation with Matrix-factORization) is a novel one-shot post-training pruning algorithm.<n>Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices.<n>We demonstrate ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations.
arXiv Detail & Related papers (2025-10-07T02:39:20Z) - Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z) - Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z) - Scaling Probabilistic Circuits via Monarch Matrices [109.65822339230853]
Probabilistic Circuits (PCs) are tractable representations of probability distributions.<n>We propose a novel sparse and structured parameterization for the sum blocks in PCs.
arXiv Detail & Related papers (2025-06-14T07:39:15Z) - The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference [0.9954176833299684]
Deep learning (DL) has led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats.<n>This paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer arithmetic.
arXiv Detail & Related papers (2025-06-13T12:40:16Z) - BitNet b1.58 2B4T Technical Report [118.78752947128682]
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale.<n>Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability.
arXiv Detail & Related papers (2025-04-16T17:51:43Z) - Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs [8.17483100683993]
We introduce a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of multiply-accumulators (MACs)<n>We propose four optimization techniques that improve timing, area, and power consumption.<n>Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively.
arXiv Detail & Related papers (2025-03-08T21:21:23Z) - DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN
Training and Inference [4.718504401468233]
PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues.
Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation.
We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.
arXiv Detail & Related papers (2023-05-12T10:58:21Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - IMAC: In-memory multi-bit Multiplication andACcumulation in 6T SRAM
Array [5.29958909018578]
In-memory computing aims at embedding some aspects of computations inside the memory array.
We present a novel in-memory multiplication followed by accumulation operation capable of performing parallel dot products within 6T array.
The proposed system is 6.24x better in energy consumption and 9.42x better in delay.
arXiv Detail & Related papers (2020-03-27T17:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.