Accelerating Markov Random Field Inference with Uncertainty
Quantification
- URL: http://arxiv.org/abs/2108.00570v1
- Date: Mon, 2 Aug 2021 00:02:53 GMT
- Title: Accelerating Markov Random Field Inference with Uncertainty
Quantification
- Authors: Ramin Bashizade, Xiangyu Zhang, Sayan Mukherjee, Alvin R. Lebeck
- Abstract summary: probabilistic algorithms are computationally expensive on conventional processors.
Their statistical properties, namely interpretability and uncertainty quantification (UQ) make them an attractive alternative approach.
We propose a high- throughput accelerator for Markov Random Field (MRF) inference.
We also propose a novel hybrid on-chip/off-chip memory system and logging scheme to efficiently support UQ.
- Score: 10.825800519362579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Statistical machine learning has widespread application in various domains.
These methods include probabilistic algorithms, such as Markov Chain
Monte-Carlo (MCMC), which rely on generating random numbers from probability
distributions. These algorithms are computationally expensive on conventional
processors, yet their statistical properties, namely interpretability and
uncertainty quantification (UQ) compared to deep learning, make them an
attractive alternative approach. Therefore, hardware specialization can be
adopted to address the shortcomings of conventional processors in running these
applications.
In this paper, we propose a high-throughput accelerator for Markov Random
Field (MRF) inference, a powerful model for representing a wide range of
applications, using MCMC with Gibbs sampling. We propose a tiled architecture
which takes advantage of near-memory computing, and memory optimizations
tailored to the semantics of MRF. Additionally, we propose a novel hybrid
on-chip/off-chip memory system and logging scheme to efficiently support UQ.
This memory system design is not specific to MRF models and is applicable to
applications using probabilistic algorithms. In addition, it dramatically
reduces off-chip memory bandwidth requirements.
We implemented an FPGA prototype of our proposed architecture using
high-level synthesis tools and achieved 146MHz frequency for an accelerator
with 32 function units on an Intel Arria 10 FPGA. Compared to prior work on
FPGA, our accelerator achieves 26X speedup. Furthermore, our proposed memory
system and logging scheme to support UQ reduces off-chip bandwidth by 71% for
two applications. ASIC analysis in 15nm shows our design with 2048 function
units running at 3GHz outperforms GPU implementations of motion estimation and
stereo vision on Nvidia RTX2080Ti by 120X-210X, occupying only 7.7% of the
area.
Related papers
- Perturbation-efficient Zeroth-order Optimization for Hardware-friendly On-device Training [48.13509528824236]
Zeroth-order (ZO) optimization is an emerging deep neural network (DNN) training paradigm that offers computational simplicity and memory savings.
ZO requires generating a substantial number of Gaussian random numbers, which poses significant difficulties and even makes it infeasible for hardware platforms, such as FPGAs and ASICs.
We propose PeZO, a perturbation-efficient ZO framework that significantly reduces the demand for random number generation.
Our experiments show that PeZO reduces the required LUTs and FFs for random number generation by 48.6% and 12.7%, and saves at maximum 86% power consumption
arXiv Detail & Related papers (2025-04-28T23:58:07Z) - Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs [0.2294388534633318]
eFPGAs allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget.
The limited eFPGA logic and memory significantly constrain compute capabilities and model size.
The proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput.
arXiv Detail & Related papers (2025-02-10T12:49:22Z) - Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z) - Pruning random resistive memory for optimizing analogue AI [54.21621702814583]
AI models present unprecedented challenges to energy consumption and environmental sustainability.
One promising solution is to revisit analogue computing, a technique that predates digital computing.
Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning.
arXiv Detail & Related papers (2023-11-13T08:59:01Z) - FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition
on The Edge [0.6254873489691849]
This paper proposed an integrated end-to-end HAR scalable HW/SW accelerator co-design based on an enhanced 8-bit quantized Two-Stream SimpleNet-PyTorch CNN architecture.
Our development uses partially streaming dataflow architecture to achieve higher throughput versus network design and resource utilization trade-off.
Our proposed methodology achieved nearly 81% prediction accuracy with an approximately 24 FPS real-time inference throughput at 187MHz on ZCU104.
arXiv Detail & Related papers (2023-11-04T10:38:21Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Design optimization for high-performance computing using FPGA [0.0]
We optimize Tensil AI's open-source inference accelerator for maximum performance using ResNet20 trained on CIFAR.
Running the CIFAR test data set shows very little accuracy drop when rounding down from the original 32-bit floating point.
The proposed accelerator achieves a throughput of 21.12 Giga-Operations Per Second (GOP/s) with a 5.21 W on-chip power consumption at 100 MHz.
arXiv Detail & Related papers (2023-04-24T22:20:42Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - Hardware architecture for high throughput event visual data filtering
with matrix of IIR filters algorithm [0.0]
Neuromorphic vision is a rapidly growing field with numerous applications in the perception systems of autonomous vehicles.
There is a significant amount of noise in the event stream due to the sensors working principle.
We present a novel algorithm based on an IIR filter matrix for filtering this type of noise and a hardware architecture that allows its acceleration.
arXiv Detail & Related papers (2022-07-02T15:18:53Z) - MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware.
Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters.
We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z) - A fully pipelined FPGA accelerator for scale invariant feature transform
keypoint descriptor matching, [0.0]
We design a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching.
The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation.
Our hardware implementation is 15.7 times faster than the comparable software approach.
arXiv Detail & Related papers (2020-12-17T15:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.