Related papers: Beyond the GPU: The Strategic Role of FPGAs in the Next Wave of AI

Beyond the GPU: The Strategic Role of FPGAs in the Next Wave of AI

URL: http://arxiv.org/abs/2511.11614v1
Date: Tue, 04 Nov 2025 03:41:42 GMT
Title: Beyond the GPU: The Strategic Role of FPGAs in the Next Wave of AI
Authors: Arturo Urías Jiménez,
Abstract summary: Field-Programmable Gate Arrays (FPGAs) are a reconfigurable platform that allows mapping AI algorithms directly into device logic.<n>Unlike CPU and GPU architecture, an FPGA can be reconfigured in the field to adapt its physical structure to a specific model.<n> Partial reconfiguration and compilation flows from AI frameworks are shortening the path from prototype to deployment.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: AI acceleration has been dominated by GPUs, but the growing need for lower latency, energy efficiency, and fine-grained hardware control exposes the limits of fixed architectures. In this context, Field-Programmable Gate Arrays (FPGAs) emerge as a reconfigurable platform that allows mapping AI algorithms directly into device logic. Their ability to implement parallel pipelines for convolutions, attention mechanisms, and post-processing with deterministic timing and reduced power consumption makes them a strategic option for workloads that demand predictable performance and deep customization. Unlike CPUs and GPUs, whose architecture is immutable, an FPGA can be reconfigured in the field to adapt its physical structure to a specific model, integrate as a SoC with embedded processors, and run inference near the sensor without sending raw data to the cloud. This reduces latency and required bandwidth, improves privacy, and frees GPUs from specialized tasks in data centers. Partial reconfiguration and compilation flows from AI frameworks are shortening the path from prototype to deployment, enabling hardware--algorithm co-design.

Related papers

An LLVM-Based Optimization Pipeline for SPDZ [0.0]
We implement a proof-of-concept LLVM-based optimization pipeline for the SPDZ protocol.<n>Our front end accepts a subset of C with lightweight privacy annotations and lowers it to LLVM IR.<n>Our back end performs data-flow and control-flow analysis on the optimized IR to drive a non-blocking runtime scheduler.
arXiv Detail & Related papers (2025-12-11T20:53:35Z)
GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric Workflow [30.600571989643626]
Animation-based Generative Codec (AGC) is emerging paradigm for talking-face video compression.<n> deploying its intricate decoder on resource and power-constrained edge devices presents challenges.<n>This paper proposes a novel field programmable gate arrays (FPGAs)-oriented AGC deployment scheme for edge-computing video services.
arXiv Detail & Related papers (2025-11-12T12:31:40Z)
Implémentation Efficiente de Fonctions de Convolution sur FPGA à l'Aide de Blocs Paramétrables et d'Approximations Polynomiales [0.3966519779235704]
Implementing convolutional neural networks (CNNs) on field-programmable gate arrays (FPGAs) has emerged as a promising alternative to GPUs.<n>This paper proposes a library of convolution Blocks designed to optimize FPGA implementation and adapt to available resources.<n>It also presents a methodological framework for developing mathematical models that predict FPGA resources utilization.
arXiv Detail & Related papers (2025-10-03T15:58:20Z)
FPGA-based Acceleration for Convolutional Neural Networks: A Comprehensive Review [3.7810245817090906]
Convolutional Neural Networks (CNNs) are fundamental to deep learning, driving applications across various domains.<n>This paper provides a comprehensive review of FPGA-based hardware accelerators specifically designed for CNNs.
arXiv Detail & Related papers (2025-05-04T04:03:37Z)
Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs [0.2294388534633318]
eFPGAs allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget.<n>The limited eFPGA logic and memory significantly constrain compute capabilities and model size.<n>The proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput.
arXiv Detail & Related papers (2025-02-10T12:49:22Z)
DynaSplit: A Hardware-Software Co-Design Framework for Energy-Aware Inference on Edge [40.96858640950632]
We propose DynaSplit, a framework that dynamically configures parameters across both software and hardware. We evaluate DynaSplit using popular pre-trained NNs on a real-world testbed. Results show a reduction in energy consumption up to 72% compared to cloud-only computation.
arXiv Detail & Related papers (2024-10-31T12:44:07Z)
HAPM -- Hardware Aware Pruning Method for CNN hardware accelerators in resource constrained devices [44.99833362998488]
The present work proposes a generic hardware architecture ready to be implemented on FPGA devices. The inference speed of the design is evaluated over different resource constrained FPGA devices. We demonstrate that our hardware-aware pruning algorithm achieves a remarkable improvement of a 45 % in inference time compared to a network pruned using the standard algorithm.
arXiv Detail & Related papers (2024-08-26T07:27:12Z)
Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA [20.629635991749808]
This paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs. At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads. At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient BayesNNs.
arXiv Detail & Related papers (2024-06-20T17:08:42Z)
SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices [48.47320494918925]
This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources.
arXiv Detail & Related papers (2023-09-04T13:15:01Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.