Related papers: Computing-In-Memory Dataflow for Minimal Buffer Traffic

Computing-In-Memory Dataflow for Minimal Buffer Traffic

URL: http://arxiv.org/abs/2508.14375v1
Date: Wed, 20 Aug 2025 03:05:40 GMT
Title: Computing-In-Memory Dataflow for Minimal Buffer Traffic
Authors: Choongseok Song, Doo Seok Jeong,
Abstract summary: We introduce a novel CIM dataflow that significantly reduces buffer traffic by maximizing data reuse and improving memory utilization.<n>When applied to MobileNet and EfficientNet models, our dataflow reduces buffer traffic by 77.4-87.0%, leading to a total reduction in data traffic energy and latency by 10.1-17.9% and 15.6-27.8%, respectively.
Score: 1.9234504183934982
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computing-In-Memory (CIM) offers a potential solution to the memory wall issue and can achieve high energy efficiency by minimizing data movement, making it a promising architecture for edge AI devices. Lightweight models like MobileNet and EfficientNet, which utilize depthwise convolution for feature extraction, have been developed for these devices. However, CIM macros often face challenges in accelerating depthwise convolution, including underutilization of CIM memory and heavy buffer traffic. The latter, in particular, has been overlooked despite its significant impact on latency and energy consumption. To address this, we introduce a novel CIM dataflow that significantly reduces buffer traffic by maximizing data reuse and improving memory utilization during depthwise convolution. The proposed dataflow is grounded in solid theoretical principles, fully demonstrated in this paper. When applied to MobileNet and EfficientNet models, our dataflow reduces buffer traffic by 77.4-87.0%, leading to a total reduction in data traffic energy and latency by 10.1-17.9% and 15.6-27.8%, respectively, compared to the baseline (conventional weight-stationary dataflow).

Related papers

MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs [5.88896081401217]
We introduce MEADOW, a framework that significantly reduces the off-chip memory access for large language models.<n> MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation.<n> MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.
arXiv Detail & Related papers (2025-02-14T23:50:37Z)
DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort. DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z)
TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Dataflow and Analytical Modelling [0.0]
Convolutional Neural Networks (CNNs) are susceptible to the Von Neumann bottleneck.<n>This bottleneck relates to the energy cost of moving data between the processing cores and the memory.<n>In this paper, we propose TrIM: a novel dataflow for SAs based on a Triangular Input Movement and compatible with CNN computing.
arXiv Detail & Related papers (2024-08-02T13:15:17Z)
Full-Stack Optimization for CAM-Only DNN Inference [2.0837295518447934]
This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors. We propose a novel compilation flow to optimize convolutions on APs by reducing their arithmetic intensity. Our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators.
arXiv Detail & Related papers (2024-01-23T10:27:38Z)
EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality. On the software side, we evaluate epitomes' latency and energy on PIM accelerators. We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z)
RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on Edge [1.8293684411977293]
Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power. We present RAMAN, a Re-configurable and spArse tinyML Accelerator for infereNce on edge, architected to exploit the sparsity to reduce area (storage), power as well as latency.
arXiv Detail & Related papers (2023-06-10T17:25:58Z)
Improving information retention in large scale online continual learning [99.73847522194549]
Online continual learning aims to adapt efficiently to new data while retaining existing knowledge. Recent work suggests that information retention remains a problem in large scale OCL even when the replay buffer is unlimited. We propose using a moving average family of methods to improve optimization for non-stationary objectives.
arXiv Detail & Related papers (2022-10-12T16:59:43Z)
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs. We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z)
ATTACC the Quadratic Bottleneck of Attention Layers [3.2741800634280245]
This paper introduces a new attention-tailored dataflow, termed FLAT, for deep neural network (DNN) accelerators. It increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer. In our evaluation, ATTACC achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction compared to state-of-the-art edge and cloud accelerators.
arXiv Detail & Related papers (2021-07-13T22:23:40Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
Highly Efficient Salient Object Detection with 100K Parameters [137.74898755102387]
We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features. We build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% (100k) of large models on popular object detection benchmarks.
arXiv Detail & Related papers (2020-03-12T07:00:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.