Related papers: CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures

CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures

URL: http://arxiv.org/abs/2401.07671v2
Date: Wed, 17 Jan 2024 13:49:05 GMT
Title: CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures
Authors: Rebecca Pelke, Jose Cubero-Cascante, Nils Bosbach, Felix Staudigl, Rainer Leupers, Jan Moritz Joseph
Abstract summary: We present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms.
Score: 0.1747623282473278
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 x , resulting in an overall speedup increase of up to 29.2 x compared to SOTA.

Related papers

Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators [0.8708298560474775]
We propose a mixed-precision training and compilation framework for CIM architectures.<n>The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters.<n>In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.
arXiv Detail & Related papers (2026-01-29T13:54:55Z)
Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z)
Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs [1.5099440840283813]
Structured sparsity enables deploying large language models (LLMs) on resource-constrained systems.<n>We present a framework with novel mapping and scheduling strategies to accelerate sparse LLM inference on CIM accelerators.<n>Our approach improves CIM utilization by over 50%, achieving more than 4x reduction in both memory footprint and the number of required floating-point operations.
arXiv Detail & Related papers (2025-10-13T09:25:48Z)
Training Long-Context LLMs Efficiently via Chunk-wise Optimization [60.05884946552877]
We present textitSequential Chunk-wise Optimization (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks.<n>We also introduce textitSparse Chunk-wise Optimization (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks.<n>SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer.
arXiv Detail & Related papers (2025-05-22T14:11:34Z)
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM [7.651654889371008]
Transformer-based models are the foundation of modern machine learning, but their execution places significant pressure on memory systems.<n> processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory.<n>Current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques.
arXiv Detail & Related papers (2025-05-09T04:17:05Z)
PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation [1.2848824355101671]
This paper introduces a novel probabilistic approximate computation (PAC) method that reduces approximation error by 4X compared to existing approaches. PAC enables efficient sparsity-based computation in compute-in-memory (CiM) systems by simplifying complex MAC vector computations into scalar calculations. We develop PACiM, a sparsity-centric architecture that fully exploits sparsity to reduce bit-serial cycles by 81% and achieves a peak 8b/8b efficiency of 14.63 TOPS/W in 65 nm CMOS.
arXiv Detail & Related papers (2024-08-29T03:58:19Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Mechanistic Design and Scaling of Hybrid Architectures [114.3129802943915]
We identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis. We find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures.
arXiv Detail & Related papers (2024-03-26T16:33:12Z)
CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators [10.756046653406296]
We propose CIM-MLC, a universal multi-level compilation framework for general CIM architectures. CIM-MLC can explore the mapping and scheduling strategies across multiple architectural tiers.
arXiv Detail & Related papers (2024-01-23T01:33:09Z)
WWW: What, When, Where to Compute-in-Memory [6.944507442667191]
Matrix multiplication is the dominant computation during Machine Learning (ML) inference. To efficiently perform such multiplication operations, Compute-in-memory (CiM) paradigms have emerged as a highly energy efficient solution.
arXiv Detail & Related papers (2023-12-26T06:16:12Z)
Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching [53.91395791840179]
We present Unified Spectral Bundling with Sketching (USBS), a provably correct, fast and scalable algorithm for solving massive SDPs. USBS provides a 500x speed-up over the state-of-the-art scalable SDP solver on an instance with over 2 billion decision variables.
arXiv Detail & Related papers (2023-12-19T02:27:22Z)
DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory [6.367916611208411]
We propose DDC-PIM, an efficient algorithm/architecture co-design methodology that effectively doubles the equivalent data capacity. DDC-PIM yields about $2.84times$ speedup on MobileNetV2 and $2.69times$ on EfficientNet-B0 with negligible accuracy loss. Compared with state-of-the-art macros, DDC-PIM achieves up to $8.41times$ and $2.75times$ improvement in weight density and area efficiency, respectively.
arXiv Detail & Related papers (2023-10-31T12:49:54Z)
Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel. Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU. Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z)
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System [9.429605859159023]
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound. Memory-centric computing systems, with processing-in-memory capabilities, can alleviate this data movement bottleneck. We implement several representative classic ML algorithms on a real-world general-purpose PIM architecture.
arXiv Detail & Related papers (2022-07-16T09:39:53Z)
Collaborative Intelligent Reflecting Surface Networks with Multi-Agent Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks. In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z)
MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with Co-designed Compressed Neural Networks [0.6817102408452476]
Convolutional neural networks (CNNs) play a key role in deep learning applications. CIM architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication. To reduce computation costs, network pruning and quantization are two widely studied compression methods to shrink the model size.
arXiv Detail & Related papers (2020-10-24T10:31:49Z)
One-step regression and classification with crosspoint resistive memory arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge. One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition. Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.