Related papers: Accelerating Markov Random Field Inference with Uncertainty Quantification

Accelerating Markov Random Field Inference with Uncertainty Quantification

URL: http://arxiv.org/abs/2108.00570v1
Date: Mon, 2 Aug 2021 00:02:53 GMT
Title: Accelerating Markov Random Field Inference with Uncertainty Quantification
Authors: Ramin Bashizade, Xiangyu Zhang, Sayan Mukherjee, Alvin R. Lebeck
Abstract summary: probabilistic algorithms are computationally expensive on conventional processors. Their statistical properties, namely interpretability and uncertainty quantification (UQ) make them an attractive alternative approach. We propose a high- throughput accelerator for Markov Random Field (MRF) inference. We also propose a novel hybrid on-chip/off-chip memory system and logging scheme to efficiently support UQ.
Score: 10.825800519362579
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Statistical machine learning has widespread application in various domains. These methods include probabilistic algorithms, such as Markov Chain Monte-Carlo (MCMC), which rely on generating random numbers from probability distributions. These algorithms are computationally expensive on conventional processors, yet their statistical properties, namely interpretability and uncertainty quantification (UQ) compared to deep learning, make them an attractive alternative approach. Therefore, hardware specialization can be adopted to address the shortcomings of conventional processors in running these applications. In this paper, we propose a high-throughput accelerator for Markov Random Field (MRF) inference, a powerful model for representing a wide range of applications, using MCMC with Gibbs sampling. We propose a tiled architecture which takes advantage of near-memory computing, and memory optimizations tailored to the semantics of MRF. Additionally, we propose a novel hybrid on-chip/off-chip memory system and logging scheme to efficiently support UQ. This memory system design is not specific to MRF models and is applicable to applications using probabilistic algorithms. In addition, it dramatically reduces off-chip memory bandwidth requirements. We implemented an FPGA prototype of our proposed architecture using high-level synthesis tools and achieved 146MHz frequency for an accelerator with 32 function units on an Intel Arria 10 FPGA. Compared to prior work on FPGA, our accelerator achieves 26X speedup. Furthermore, our proposed memory system and logging scheme to support UQ reduces off-chip bandwidth by 71% for two applications. ASIC analysis in 15nm shows our design with 2048 function units running at 3GHz outperforms GPU implementations of motion estimation and stereo vision on Nvidia RTX2080Ti by 120X-210X, occupying only 7.7% of the area.

Related papers

Perturbation-efficient Zeroth-order Optimization for Hardware-friendly On-device Training [48.13509528824236]
Zeroth-order (ZO) optimization is an emerging deep neural network (DNN) training paradigm that offers computational simplicity and memory savings. ZO requires generating a substantial number of Gaussian random numbers, which poses significant difficulties and even makes it infeasible for hardware platforms, such as FPGAs and ASICs. We propose PeZO, a perturbation-efficient ZO framework that significantly reduces the demand for random number generation. Our experiments show that PeZO reduces the required LUTs and FFs for random number generation by 48.6% and 12.7%, and saves at maximum 86% power consumption
arXiv Detail & Related papers (2025-04-28T23:58:07Z)
Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs [0.2294388534633318]
eFPGAs allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget. The limited eFPGA logic and memory significantly constrain compute capabilities and model size. The proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput.
arXiv Detail & Related papers (2025-02-10T12:49:22Z)
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z)
Pruning random resistive memory for optimizing analogue AI [54.21621702814583]
AI models present unprecedented challenges to energy consumption and environmental sustainability. One promising solution is to revisit analogue computing, a technique that predates digital computing. Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning.
arXiv Detail & Related papers (2023-11-13T08:59:01Z)
FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition on The Edge [0.6254873489691849]
This paper proposed an integrated end-to-end HAR scalable HW/SW accelerator co-design based on an enhanced 8-bit quantized Two-Stream SimpleNet-PyTorch CNN architecture. Our development uses partially streaming dataflow architecture to achieve higher throughput versus network design and resource utilization trade-off. Our proposed methodology achieved nearly 81% prediction accuracy with an approximately 24 FPS real-time inference throughput at 187MHz on ZCU104.
arXiv Detail & Related papers (2023-11-04T10:38:21Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Design optimization for high-performance computing using FPGA [0.0]
We optimize Tensil AI's open-source inference accelerator for maximum performance using ResNet20 trained on CIFAR. Running the CIFAR test data set shows very little accuracy drop when rounding down from the original 32-bit floating point. The proposed accelerator achieves a throughput of 21.12 Giga-Operations Per Second (GOP/s) with a 5.21 W on-chip power consumption at 100 MHz.
arXiv Detail & Related papers (2023-04-24T22:20:42Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
Hardware architecture for high throughput event visual data filtering with matrix of IIR filters algorithm [0.0]
Neuromorphic vision is a rapidly growing field with numerous applications in the perception systems of autonomous vehicles. There is a significant amount of noise in the event stream due to the sensors working principle. We present a novel algorithm based on an IIR filter matrix for filtering this type of noise and a hardware architecture that allows its acceleration.
arXiv Detail & Related papers (2022-07-02T15:18:53Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
A fully pipelined FPGA accelerator for scale invariant feature transform keypoint descriptor matching, [0.0]
We design a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching. The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation. Our hardware implementation is 15.7 times faster than the comparable software approach.
arXiv Detail & Related papers (2020-12-17T15:29:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.