Related papers: CORDIC Is All You Need

CORDIC Is All You Need

URL: http://arxiv.org/abs/2503.11685v1
Date: Tue, 04 Mar 2025 12:23:27 GMT
Title: CORDIC Is All You Need
Authors: Omkar Kokane, Adam Teman, Anushka Jha, Guru Prasath SL, Gopal Raut, Mukul Lokhande, S. V. Jaya Chand, Tanushree Dewangan, Santosh Kumar Vishvakarma,
Abstract summary: We present pipelined architecture with CORDIC block for linear MAC computations and nonlinear iterative Activation Functions.<n>This approach focuses on a Reconfigurable Processing Engine (RPE) based systolic array.<n>FPGA implementation achieves a reduction of up to 2.5 $times$ resource savings and 3 $times$ power compared to prior works.
Score: 0.18184027690235535
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Artificial intelligence necessitates adaptable hardware accelerators for efficient high-throughput million operations. We present pipelined architecture with CORDIC block for linear MAC computations and nonlinear iterative Activation Functions (AF) such as $tanh$, $sigmoid$, and $softmax$. This approach focuses on a Reconfigurable Processing Engine (RPE) based systolic array, with 40\% pruning rate, enhanced throughput up to 4.64$\times$, and reduction in power and area by 5.02 $\times$ and 4.06 $\times$ at CMOS 28 nm, with minor accuracy loss. FPGA implementation achieves a reduction of up to 2.5 $\times$ resource savings and 3 $\times$ power compared to prior works. The Systolic CORDIC engine for Reconfigurability and Enhanced throughput (SYCore) deploys an output stationary dataflow with the CAESAR control engine for diverse AI workloads such as Transformers, RNNs/LSTMs, and DNNs for applications like image detection, LLMs, and speech recognition. The energy-efficient and flexible approach extends the enhanced approach for edge AI accelerators supporting emerging workloads.

Related papers

CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications [0.0]
This brief presents a runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration.<n>The proposed design enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads.<n> ASIC implementation results show that each MAC stage can save up to 33% of time and 21% of power, with a 256-PE configuration.
arXiv Detail & Related papers (2026-02-22T16:51:17Z)
FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference [0.8749675983608171]
Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks.<n>This work introduces an automation framework that leverages weight pruning and low-bit quantization.<n>We present a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform.
arXiv Detail & Related papers (2025-12-31T08:27:40Z)
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding [144.70522923640095]
Large language models (LLMs) face low hardware efficiency during decoding.<n>This paper introduces Step-3, a hardware-aware model-system co-design optimized for minimizing decoding costs.<n>Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B.
arXiv Detail & Related papers (2025-07-25T16:53:13Z)
DPUV4E: High-Throughput DPU Architecture Design for CNN on Versal ACAP [2.9864816670649246]
AMD's Versal ACAP architecture, tailored for AI applications, incorporates AI Engines (AIEs) to deliver high computational power.<n>We present DPUV4E for the Versal architecture, providing configurations ranging from 2PE ($32.6$ TOPS) to 8PE ($131.0$ TOPS)<n>Our design provides $8.6times$ the TOPS/W of traditional FPGA-based DPU designs, while reducing DSP usage by $95.8%$, LUT usage by $44.7%$, and latency to $68.5%$ under single-batch conditions.
arXiv Detail & Related papers (2025-06-13T03:39:05Z)
MARCA: Mamba Accelerator with ReConfigurable Architecture [16.48279181435065]
We propose a Mamba accelerator with reconfigurable architecture, MARCA. Reduction alternative PE array architecture for both linear and element-wise operations. Reusable nonlinear function unit based on the reconfigurable PE. Intra-operation and inter-operation buffer management strategy.
arXiv Detail & Related papers (2024-09-16T15:18:33Z)
Stochastic Spiking Attention: Accelerating Attention with Stochastic Computing in Spiking Networks [33.51445486269896]
Spiking Neural Networks (SNNs) have been recently integrated into Transformer architectures due to their potential to reduce computational demands and to improve power efficiency. We propose a novel framework leveraging computing (SC) to effectively execute the dot-product attention for SNN-based Transformers.
arXiv Detail & Related papers (2024-02-14T11:47:19Z)
A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE [0.8403582577557918]
Transformer has been adopted to image recognition tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. We propose a lightweight hybrid model which uses Neural ODE as a backbone instead of ResNet. The proposed model is deployed on a modest-sized FPGA device for edge computing.
arXiv Detail & Related papers (2024-01-05T09:32:39Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
Accurate, Low-latency, Efficient SAR Automatic Target Recognition on FPGA [3.251765107970636]
Synthetic aperture radar (SAR) automatic target recognition (ATR) is the key technique for remote-sensing image recognition. The state-of-the-art convolutional neural networks (CNNs) for SAR ATR suffer from emphhigh computation cost and emphlarge memory footprint. We propose a comprehensive GNN-based model-architecture co-design on FPGA to address the above issues.
arXiv Detail & Related papers (2023-01-04T05:35:30Z)
RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems [68.8204255655161]
We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP. RAMP supports large-scale distributed and parallel computing systems (12.8Tbps per node for up to 65,536 nodes.
arXiv Detail & Related papers (2022-11-28T11:24:51Z)
Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches. We propose a decoder-free fully transformer-based (DFFT) object detector. DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z)
Single-Shot Optical Neural Network [55.41644538483948]
'Weight-stationary' analog optical and electronic hardware has been proposed to reduce the compute resources required by deep neural networks. We present a scalable, single-shot-per-layer weight-stationary optical processor.
arXiv Detail & Related papers (2022-05-18T17:49:49Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms. This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z)
ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for Deep Learning [1.2019888796331233]
Matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) are the two most expensive operations associated with the training of deep neural networks (DNNs) We introduce efficient techniques to SC for weight update in DNNs with the activation functions required by many state-of-the-art networks. Our architecture reduces the computational cost by re-using random numbers and replacing certain FP multiplication operations by bit shift scaling. Hardware design of ESSOP at 14nm technology node shows that, compared to a highly pipelined FP16 multiplier, ESSOP is 82.2% and 93.7% better in energy
arXiv Detail & Related papers (2020-03-25T07:54:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.