Related papers: Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

URL: http://arxiv.org/abs/2407.14274v1
Date: Fri, 19 Jul 2024 12:54:04 GMT
Title: Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations
Authors: Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, Dimitrios Soudris,
Abstract summary: Modern embedded microprocessors provide very limited support for mixed-precision NNs. We introduce a hardware-software co-design framework that enables cooperative hardware design, mixed-precision quantization, ISA extensions and inference. Our framework can achieve, on average, 15x energy reduction for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores.
Score: 5.847997723738113
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low precision, can attain accuracies comparable to full-precision counterparts. However, modern embedded microprocessors provide very limited support for mixed-precision NNs regarding both Instruction Set Architecture (ISA) extensions and their hardware design for efficient execution of mixed-precision operations, i.e., introducing several performance bottlenecks due to numerous instructions for data packing and unpacking, arithmetic unit under-utilizations etc. In this work, we bring together, for the first time, ISA extensions tailored to mixed-precision hardware optimizations, targeting energy-efficient DNN inference on leading RISC-V CPU architectures. To this end, we introduce a hardware-software co-design framework that enables cooperative hardware design, mixed-precision quantization, ISA extensions and inference in cycle-accurate emulations. At hardware level, we firstly expand the ALU unit within our proof-of-concept micro-architecture to support configurable fine grained mixed-precision arithmetic operations. Subsequently, we implement multi-pumping to minimize execution latency, with an additional soft SIMD optimization applied for 2-bit operations. At the ISA level, three distinct MAC instructions are encoded extending the RISC-V ISA, and exposed up to the compiler level, each corresponding to a different mixed-precision operational mode. Our extensive experimental evaluation over widely used DNNs and datasets, such as CIFAR10 and ImageNet, demonstrates that our framework can achieve, on average, 15x energy reduction for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores.

Related papers

The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference [0.9954176833299684]
Deep learning (DL) has led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats.<n>This paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer arithmetic.
arXiv Detail & Related papers (2025-06-13T12:40:16Z)
POLARON: Precision-aware On-device Learning and Adaptive Runtime-cONfigurable AI acceleration [0.0]
This work presents a SIMD-enabled, multi-precision MAC engine that performs efficient multiply-accumulate operations.<n>The architecture incorporates a layer adaptive precision strategy to align computational accuracy with workload sensitivity.<n>Results demonstrate up to 2x improvement in PDP and 3x reduction in resource usage compared to SoTA designs.
arXiv Detail & Related papers (2025-06-10T13:33:02Z)
Design and Implementation of a RISC-V SoC with Custom DSP Accelerators for Edge Computing [0.0]
We examine the RV32I base instruction set with extensions for multiplication (M) and atomic operations (A)<n>Our results demonstrate RISC-V's advantages in embedded systems and its scalability for custom accelerators.
arXiv Detail & Related papers (2025-06-07T07:17:40Z)
DOMAC: Differentiable Optimization for High-Speed Multipliers and Multiply-Accumulators [25.876084896293058]
DOMAC is a novel approach that employs differentiable optimization for designing multipliers and MACs at specific technology nodes. Building on this insight, DOMAC reformulates the discrete optimization challenge into a continuous problem by incorporating differentiable timing and area objectives.
arXiv Detail & Related papers (2025-03-31T10:49:05Z)
MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization [6.456189487006878]
Quantization of foundational models (FMs) is challenging due to the emergence of large magnitude features called outliers. Existing outlier-aware algorithm/architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision. We propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization.
arXiv Detail & Related papers (2024-11-08T02:25:45Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
MCU-MixQ: A HW/SW Co-optimized Mixed-precision Neural Network Design Framework for MCUs [9.719789698194154]
Mixed-precision neural network (MPNN) that utilizes just enough data width for the neural network processing is an effective approach to meet the stringent resources constraints. However, there is still a lack of sub-byte and mixed-precision SIMD operations in MCU-class ISA. In this work, we propose to pack multiple low-bitwidth arithmetic operations within a single instruction multiple data (SIMD) instructions in typical MCUs.
arXiv Detail & Related papers (2024-07-17T14:51:15Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference [18.0155410476884]
Photonic tensor core (PTC) designs based on standard optical components hinder scalability and compute density due to their large spatial footprint. We propose an ultra-compact PTC using customized programmable multi-operand multimode interference (MOMMI) devices, named M3ICRO. M3ICRO achieves a 3.4-9.6x smaller footprint, 1.6-4.4x higher speed, 10.6-42x higher compute density, 3.7-12x higher system throughput, and superior noise robustness.
arXiv Detail & Related papers (2023-05-31T02:34:36Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks [0.38848561367220275]
Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. Here, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs. In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations.
arXiv Detail & Related papers (2021-08-21T09:50:54Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
Measuring what Really Matters: Optimizing Neural Networks for TinyML [7.455546102930911]
neural networks (NNs) have experienced an unprecedented growth in architectural and computational complexity. Introducing NNs to resource-constrained devices enables cost-efficient deployments, widespread availability, and the preservation of sensitive data. This work addresses the challenges of bringing Machine Learning to MCUs, where we focus on the ubiquitous ARM Cortex-M architecture.
arXiv Detail & Related papers (2021-04-21T17:14:06Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)
Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits [99.59941892183454]
We propose Einsum Networks (EiNets), a novel implementation design for PCs. At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation. We show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation.
arXiv Detail & Related papers (2020-04-13T23:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.