Related papers: Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

URL: http://arxiv.org/abs/2507.01457v1
Date: Wed, 02 Jul 2025 08:15:33 GMT
Title: Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs
Authors: Federico Nicolas Peccia, Frederik Haxel, Oliver Bringmann,
Abstract summary: We present a workflow based on the TVM compiler to efficiently map AI workloads onto RISC-V vector units.<n>Our proposal shows a mean improvement of 46% in execution latency when compared against the autovectorization feature of GCC.<n>We open-sourced our proposal for the community to expand it to target other RISC-V extensions.
Score: 0.6242215470795112
License: http://creativecommons.org/licenses/by/4.0/
Abstract: RISC-V provides a flexible and scalable platform for applications ranging from embedded devices to high-performance computing clusters. Particularly, its RISC-V Vector Extension (RVV) becomes of interest for the acceleration of AI workloads. But writing software that efficiently utilizes the vector units of RISC-V CPUs without expert knowledge requires the programmer to rely on the autovectorization features of compilers or hand-crafted libraries like muRISCV-NN. Smarter approaches, like autotuning frameworks, have been missing the integration with the RISC-V RVV extension, thus heavily limiting the efficient deployment of complex AI workloads. In this paper, we present a workflow based on the TVM compiler to efficiently map AI workloads onto RISC-V vector units. Instead of relying on hand-crafted libraries, we integrated the RVV extension into TVM's MetaSchedule framework, a probabilistic program framework for tensor operation tuning. We implemented different RISC-V SoCs on an FPGA and tuned a wide range of AI workloads on them. We found that our proposal shows a mean improvement of 46% in execution latency when compared against the autovectorization feature of GCC, and 29% against muRISCV-NN. Moreover, the binary resulting from our proposal has a smaller code memory footprint, making it more suitable for embedded devices. Finally, we also evaluated our solution on a commercially available RISC-V SoC implementing the RVV 1.0 Vector Extension and found our solution is able to find mappings that are 35% faster on average than the ones proposed by LLVM. We open-sourced our proposal for the community to expand it to target other RISC-V extensions.

Related papers

Design and Implementation of a RISC-V SoC with Custom DSP Accelerators for Edge Computing [0.0]
We examine the RV32I base instruction set with extensions for multiplication (M) and atomic operations (A)<n>Our results demonstrate RISC-V's advantages in embedded systems and its scalability for custom accelerators.
arXiv Detail & Related papers (2025-06-07T07:17:40Z)
Hardware/Software Co-Design of RISC-V Extensions for Accelerating Sparse DNNs on FPGAs [1.4225653519332482]
We propose novel RISC-V extensions for accelerating DNN models containing semi-structured and unstructured sparsity.<n>Our designs consume a small amount of additional FPGA resources such that the resulting co-designs enable the acceleration of DNNs even on small FPGAs.<n>We benchmark our designs on standard TinyML applications such as keyword spotting, image classification, and person detection.
arXiv Detail & Related papers (2025-04-28T10:19:39Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
RISC-V RVV efficiency for ANN algorithms [0.5892638927736115]
This study examines the effectiveness of applying RVV to commonly used ANN algorithms. The algorithms were adapted for RISC-V and optimized using RVV after identifying the primary bottlenecks.
arXiv Detail & Related papers (2024-07-18T09:26:07Z)
RISC-V R-Extension: Advancing Efficiency with Rented-Pipeline for Edge DNN Processing [0.8192907805418583]
This paper introduces the RISC-V R-extension, a novel approach to enhancing deep neural network (DNN) process efficiency on edge devices. The extension features rented-pipeline stages and architectural pipeline registers (APR), which optimize critical operation execution, thereby reducing latency and memory access frequency.
arXiv Detail & Related papers (2024-07-02T19:25:05Z)
Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA [20.629635991749808]
This paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs. At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads. At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient BayesNNs.
arXiv Detail & Related papers (2024-06-20T17:08:42Z)
Machine Learning Insides OptVerse AI Solver: Design Principles and Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z)
Joint User Association, Interference Cancellation and Power Control for Multi-IRS Assisted UAV Communications [80.35959154762381]
Intelligent reflecting surface (IRS)-assisted unmanned aerial vehicle (UAV) communications are expected to alleviate the load of ground base stations in a cost-effective way. Existing studies mainly focus on the deployment and resource allocation of a single IRS instead of multiple IRSs. We propose a new optimization algorithm for joint IRS-user association, trajectory optimization of UAVs, successive interference cancellation (SIC) decoding order scheduling and power allocation.
arXiv Detail & Related papers (2023-12-08T01:57:10Z)
Improved vectorization of OpenCV algorithms for RISC-V CPUs [0.0]
We discuss the possibilities of accelerating computations on available RISC-V processors. It is shown that improved vectorization speeds up computations on existing prototypes of RISC-V devices by tens of percent.
arXiv Detail & Related papers (2023-09-19T12:36:03Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
Reconfigurable Intelligent Surface Assisted Mobile Edge Computing with Heterogeneous Learning Tasks [53.1636151439562]
Mobile edge computing (MEC) provides a natural platform for AI applications. We present an infrastructure to perform machine learning tasks at an MEC with the assistance of a reconfigurable intelligent surface (RIS) Specifically, we minimize the learning error of all participating users by jointly optimizing transmit power of mobile users, beamforming vectors of the base station, and the phase-shift matrix of the RIS.
arXiv Detail & Related papers (2020-12-25T07:08:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.