Chiplet-Based RISC-V SoC with Modular AI Acceleration
- URL: http://arxiv.org/abs/2509.18355v3
- Date: Sun, 19 Oct 2025 12:05:36 GMT
- Title: Chiplet-Based RISC-V SoC with Modular AI Acceleration
- Authors: Suhas Suresh Bharadwaj, Prerana Ramkumar,
- Abstract summary: This paper presents a novel chiplet-based RISC-V system-on-a-chip (SoC) for edge AI devices.<n>The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators, 16GB3 memory stacks, and dedicated power management controllers.<n>The AI-optimized configuration achieves 14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI acceleration and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and compression-aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.
Related papers
- Evolutionary Mapping of Neural Networks to Spatial Accelerators [64.13809409887254]
We introduce the first evolutionary, hardware-in-the-loop mapping framework for neuromorphic accelerators.<n>We evaluate our approach on Intel Loihi 2, a representative spatial accelerator featuring 152 cores in a 2D mesh.<n>Our method achieves up to 35% reduction in total latency compared to default cores on two sparse multi-layer perceptron networks.
arXiv Detail & Related papers (2026-02-04T16:28:08Z) - Implementation of high-efficiency, lightweight residual spiking neural network processor based on field-programmable gate arrays [0.49806798459446283]
This work presents an efficient residual SNN accelerator that combines algorithm and hardware co-design to optimize inference energy efficiency.<n>The proposed processor achieves a classification accuracy of 87.11% on the CIFAR-10 dataset, with an inference time of 3.98 ms per image and an energy efficiency of 183.5 FPS/W.
arXiv Detail & Related papers (2025-12-09T02:08:46Z) - SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation [0.0]
We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy.<n>We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement.<n>This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par.
arXiv Detail & Related papers (2025-10-18T08:19:49Z) - VUSA: Virtually Upscaled Systolic Array Architecture to Exploit Unstructured Sparsity in AI Acceleration [0.49157446832511503]
VUSA is a systolic-array architecture that virtually grows based on the present sparsity to perform larger matrix multiplications.<n>The proposed architecture achieves saving by 37% and 68% in area and power efficiency, respectively, at the same peak-performance.
arXiv Detail & Related papers (2025-06-01T20:59:20Z) - REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption [4.713756093611972]
We present the first-of-its-kind multi-chiplet-based FHE accelerator REED' for overcoming the limitations of prior monolithic designs.<n>Results demonstrate that REED 2.5D microprocessor consumes 96.7 mm$2$ chip area, 49.4 W average power in 7nm technology.
arXiv Detail & Related papers (2023-08-05T14:04:39Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Faster Attention Is What You Need: A Fast Self-Attention Neural Network
Backbone Architecture for the Edge via Double-Condensing Attention Condensers [71.40595908386477]
We introduce a new faster attention condenser design called double-condensing attention condensers.
The resulting backbone (which we name AttendNeXt) achieves significantly higher inference throughput on an embedded ARM processor.
These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
arXiv Detail & Related papers (2022-08-15T02:47:33Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Edge AI without Compromise: Efficient, Versatile and Accurate
Neurocomputing in Resistive Random-Access Memory [0.0]
We present NeuRRAM - the first multimodal edge AI chip using RRAM CIM.
We show record energy-efficiency $5times$ - $8times$ better than prior art across various computational bit-precisions.
This work paves a way towards building highly efficient and reconfigurable edge AI hardware platforms.
arXiv Detail & Related papers (2021-08-17T21:08:51Z) - FantastIC4: A Hardware-Software Co-Design Approach for Efficiently
Running 4bit-Compact Multilayer Perceptrons [19.411734658680967]
We propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs)
Our approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances.
We show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version.
arXiv Detail & Related papers (2020-12-17T19:10:04Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.