Related papers: Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications

Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications

URL: http://arxiv.org/abs/2511.11640v1
Date: Sun, 09 Nov 2025 05:05:05 GMT
Title: Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications
Authors: Sed Centeno, Christopher Sprague, Arnab A Purkayastha, Ray Simar, Neeraj Magotra,
Abstract summary: Speculative backpropagation has emerged as a promising technique to accelerate the training of neural networks by overlapping the forward and backward passes.<n>We implement speculative backpropagation on the MNIST dataset using OpenMP as the parallel programming platform.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative backpropagation has emerged as a promising technique to accelerate the training of neural networks by overlapping the forward and backward passes. Leveraging speculative weight updates when error gradients fall within a specific threshold reduces training time without substantially compromising accuracy. In this work, we implement speculative backpropagation on the MNIST dataset using OpenMP as the parallel programming platform. OpenMP's multi-threading capabilities enable simultaneous execution of forward and speculative backpropagation steps, significantly improving training speed. The application is planned for synthesis on a state-of-the-art FPGA to demonstrate its potential for hardware acceleration. Our CPU-based experimental results demonstrate that speculative backpropagation achieves a maximum speedup of 24% in execution time when using a threshold of 0.25, and accuracy remaining within 3-4% of the baseline across various epochs. Additionally, when comparing individual step execution time, speculative backpropagation yields a maximum speedup of 35% over the baseline, demonstrating the effectiveness of overlapping forward and backward passes.

Related papers

SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting [12.317709090608837]
We present SpecEE, a fast inference engine with speculative early exiting.<n>SpecEE achieves 2.25x and 2.43x speedup with Llama2-7B on cloud and PC scenarios respectively.
arXiv Detail & Related papers (2025-04-11T02:38:53Z)
Fast Training of Recurrent Neural Networks with Stationary State Feedbacks [48.22082789438538]
Recurrent neural networks (RNNs) have recently demonstrated strong performance and faster inference than Transformers.<n>We propose a novel method that replaces BPTT with a fixed gradient feedback mechanism.
arXiv Detail & Related papers (2025-03-29T14:45:52Z)
Speedy MASt3R [68.47052557089631]
MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme.<n>Fast MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy.<n>This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
arXiv Detail & Related papers (2025-03-13T03:56:22Z)
Optimized Speculative Sampling for GPU Hardware Accelerators [14.681982904792763]
We optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate our methods.
arXiv Detail & Related papers (2024-06-16T17:19:23Z)
Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing. Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16. It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)
Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs [15.465530153038927]
Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies. However, training these models involving large parameters is both time-consuming and energy-hogging.
arXiv Detail & Related papers (2021-09-16T04:12:51Z)
BoA-PTA, A Bayesian Optimization Accelerated Error-Free SPICE Solver [2.16151779631292]
pseudo transient analysis (PTA) has shown to be one of the most promising continuation SPICE solver. We propose BoA-PTA, a Bayesian optimization accelerated PTA that can substantially accelerate simulations and improve convergence performance without introducing extra errors. We assess BoA-PTA in 43 benchmark circuits against other SOTA SPICE solvers and demonstrate an average 2.3x (maximum 3.5x) speed-up over the original CEPTA.
arXiv Detail & Related papers (2021-07-31T14:58:22Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism [0.0]
FIXAR employs fixed-point data types and arithmetic units for the first time using a SW/HW co-design approach. Quantization-Aware Training (QAT) reduces its data precision based on the range of activations and performs retraining to minimize the reward degradation. FIXAR was implemented on Xilinx U50 and 25293.3 inferences per second (IPS) training throughput and 2638.0 IPS/W accelerator efficiency.
arXiv Detail & Related papers (2021-02-24T07:22:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.