CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications
- URL: http://arxiv.org/abs/2602.19268v1
- Date: Sun, 22 Feb 2026 16:51:17 GMT
- Title: CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications
- Authors: Sonu Kumar, Mohd Faisal Khan, Mukul Lokhande, Santosh Kumar Vishvakarma,
- Abstract summary: This brief presents a runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration.<n>The proposed design enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads.<n> ASIC implementation results show that each MAC stage can save up to 33% of time and 21% of power, with a 256-PE configuration.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This brief presents a runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration. The proposed design enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads. Its resource-efficient approach further enables up to 4x throughput improvement within the same hardware resources by leveraging vectorised, time-multiplexed execution and flexible precision scaling. With a time-multiplexed multi-AF block and a lightweight pooling and normalisation unit, the proposed vector engine supports flexible precision (4/8/16-bit) and high MAC density. The ASIC implementation results show that each MAC stage can save up to 33% of time and 21% of power, with a 256-PE configuration that achieves higher compute density (4.83 TOPS/mm2 ) and energy efficiency (11.67 TOPS/W) than previous state-of-the-art work. A detailed hardware-software co-design methodology for object detection and classification tasks on Pynq-Z2 is discussed to assess the proposed architecture, demonstrating a scalable, energy-efficient solution for edge AI applications.
Related papers
- RAMAN: Resource-efficient ApproxiMate Posit Processing for Algorithm-Hardware Co-desigN [0.0]
This work presents RAMAN, a resource-efficient and approximate posit(8,2)-based Multiply-Accumulate (MAC) architecture.<n>The proposed REAP MAC engine, which is at the core of RAMAN, uses approximation in the posit multiplier to achieve significant area and power reductions.
arXiv Detail & Related papers (2025-10-26T11:09:06Z) - ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization [99.96330641363396]
ARMOR: (Adaptive Representation with Matrix-factORization) is a novel one-shot post-training pruning algorithm.<n>Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices.<n>We demonstrate ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations.
arXiv Detail & Related papers (2025-10-07T02:39:20Z) - POLARON: Precision-aware On-device Learning and Adaptive Runtime-cONfigurable AI acceleration [0.0]
This work presents a SIMD-enabled, multi-precision MAC engine that performs efficient multiply-accumulate operations.<n>The architecture incorporates a layer adaptive precision strategy to align computational accuracy with workload sensitivity.<n>Results demonstrate up to 2x improvement in PDP and 3x reduction in resource usage compared to SoTA designs.
arXiv Detail & Related papers (2025-06-10T13:33:02Z) - QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.<n>Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.<n>We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z) - CORDIC Is All You Need [0.18184027690235535]
We present pipelined architecture with CORDIC block for linear MAC computations and nonlinear iterative Activation Functions.<n>This approach focuses on a Reconfigurable Processing Engine (RPE) based systolic array.<n>FPGA implementation achieves a reduction of up to 2.5 $times$ resource savings and 3 $times$ power compared to prior works.
arXiv Detail & Related papers (2025-03-04T12:23:27Z) - Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads [0.0]
This work proposes a flexible and SIMD multiprecision processing element (FlexPE)<n>The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware.
arXiv Detail & Related papers (2024-12-16T12:25:57Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - Collaborative Intelligent Reflecting Surface Networks with Multi-Agent
Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks.
In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z) - FantastIC4: A Hardware-Software Co-Design Approach for Efficiently
Running 4bit-Compact Multilayer Perceptrons [19.411734658680967]
We propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs)
Our approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances.
We show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version.
arXiv Detail & Related papers (2020-12-17T19:10:04Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.