Related papers: RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

URL: http://arxiv.org/abs/2511.21232v1
Date: Wed, 26 Nov 2025 10:01:31 GMT
Title: RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI
Authors: Muhammed Yildirim, Ozcan Ozturk,
Abstract summary: This paper introduces a novel hardware accelerator architecture that utilizes a fused pixel-wise dataflow.<n>It computes a single output pixel to completion across all stages-expansion, depthwise convolution, and projection-by streaming data.<n>It achieves a speedup of up to 59.3x over the baseline software execution on the RISC-V core.
Score: 1.1816942730023885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing demand for on-device intelligence in Edge AI and TinyML applications requires the efficient execution of modern Convolutional Neural Networks (CNNs). While lightweight architectures like MobileNetV2 employ Depthwise Separable Convolutions (DSC) to reduce computational complexity, their multi-stage design introduces a critical performance bottleneck inherent to layer-by-layer execution: the high energy and latency cost of transferring intermediate feature maps to either large on-chip buffers or off-chip DRAM. To address this memory wall, this paper introduces a novel hardware accelerator architecture that utilizes a fused pixel-wise dataflow. Implemented as a Custom Function Unit (CFU) for a RISC-V processor, our architecture eliminates the need for intermediate buffers entirely, reducing the data movement up to 87\% compared to conventional layer-by-layer execution. It computes a single output pixel to completion across all DSC stages-expansion, depthwise convolution, and projection-by streaming data through a tightly-coupled pipeline without writing to memory. Evaluated on a Xilinx Artix-7 FPGA, our design achieves a speedup of up to 59.3x over the baseline software execution on the RISC-V core. Furthermore, ASIC synthesis projects a compact 0.284 mm$^2$ footprint with 910 mW power at 2 GHz in 28 nm, and a 1.20 mm$^2$ footprint with 233 mW power at 300 MHz in 40 nm. This work confirms the feasibility of a zero-buffer dataflow within a TinyML resource envelope, offering a novel and effective strategy for overcoming the memory wall in edge AI accelerators.

Related papers

RPU -- A Reasoning Processing Unit [4.783828820539779]
Reasoning Processing Unit (RPU) is a chiplet-based architecture designed to address the challenges of the modern memory wall.<n>RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.
arXiv Detail & Related papers (2026-02-20T19:13:19Z)
Evolutionary Mapping of Neural Networks to Spatial Accelerators [64.13809409887254]
We introduce the first evolutionary, hardware-in-the-loop mapping framework for neuromorphic accelerators.<n>We evaluate our approach on Intel Loihi 2, a representative spatial accelerator featuring 152 cores in a 2D mesh.<n>Our method achieves up to 35% reduction in total latency compared to default cores on two sparse multi-layer perceptron networks.
arXiv Detail & Related papers (2026-02-04T16:28:08Z)
Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling [14.471123653746275]
Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation.<n>Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy.
arXiv Detail & Related papers (2026-01-28T15:37:50Z)
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow [0.0]
This paper presents a low power Vision Transformer accelerator, optimized through algorithm-hardware co-design.<n>The model complexity is reduced using hardware-friendly dynamic token pruning without introducing complex mechanisms.<n>We achieve a peak throughput of 1024 GOPS at 1GHz, with an energy efficiency of 2.31 TOPS/W and an area efficiency of 858.61 GOPS/mm2.
arXiv Detail & Related papers (2025-10-16T07:44:42Z)
Efficient and accurate neural field reconstruction using resistive memory [52.68088466453264]
Traditional signal reconstruction methods on digital computers face both software and hardware challenges. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
arXiv Detail & Related papers (2024-04-15T09:33:09Z)
Spiker+: a framework for the generation of efficient Spiking Neural Networks FPGA accelerators for inference at the edge [49.42371633618761]
Spiker+ is a framework for generating efficient, low-power, and low-area customized Spiking Neural Networks (SNN) accelerators on FPGA for inference at the edge. Spiker+ is tested on two benchmark datasets, the MNIST and the Spiking Heidelberg Digits (SHD)
arXiv Detail & Related papers (2024-01-02T10:42:42Z)
FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition on The Edge [0.6254873489691849]
This paper proposed an integrated end-to-end HAR scalable HW/SW accelerator co-design based on an enhanced 8-bit quantized Two-Stream SimpleNet-PyTorch CNN architecture. Our development uses partially streaming dataflow architecture to achieve higher throughput versus network design and resource utilization trade-off. Our proposed methodology achieved nearly 81% prediction accuracy with an approximately 24 FPS real-time inference throughput at 187MHz on ZCU104.
arXiv Detail & Related papers (2023-11-04T10:38:21Z)
MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory [76.02294791513552]
We propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory. Experimental results demonstrate that our MCUFormer achieves 73.62% top-1 accuracy on ImageNet for image classification with 320KB memory.
arXiv Detail & Related papers (2023-10-25T18:00:26Z)
ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution [0.0502254944841629]
Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN.
arXiv Detail & Related papers (2023-08-30T07:23:32Z)
RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on Edge [1.8293684411977293]
Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power. We present RAMAN, a Re-configurable and spArse tinyML Accelerator for infereNce on edge, architected to exploit the sparsity to reduce area (storage), power as well as latency.
arXiv Detail & Related papers (2023-06-10T17:25:58Z)
PSCNN: A 885.86 TOPS/W Programmable SRAM-based Computing-In-Memory Processor for Keyword Spotting [0.10547353841674209]
This paper proposes a programmable CIM processor with a single large sized CIM macro instead of multiple smaller ones for power efficient computation. The proposed architecture adopts the pooling write-back method to support fused or independent convolution/pooling operations to reduce 35.9% of latency. The design fabricated in TSMC 28nm technology achieves 150.8 GOPS throughput and 885.86 TOPS/W power efficiency at 10 MHz when executing our binary keyword spotting model.
arXiv Detail & Related papers (2022-05-02T09:58:18Z)
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs. We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs [6.403349961091506]
Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads. DORY is an automatic tool to deploys on low cost MCUs with typically less than 1MB on-chip memory.
arXiv Detail & Related papers (2020-08-17T07:30:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.