Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark
- URL: http://arxiv.org/abs/2206.11791v1
- Date: Thu, 23 Jun 2022 15:57:17 GMT
- Title: Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark
- Authors: Hendrik Borras and Giuseppe Di Guglielmo and Javier Duarte and
Nicol\`o Ghielmetti and Ben Hawks and Scott Hauck and Shih-Chieh Hsu and Ryan
Kastner and Jason Liang and Andres Meza and Jules Muhizi and Tai Nguyen and
Rushil Roy and Nhan Tran and Yaman Umuroglu and Olivia Weng and Aidan Yokuda
and Michaela Blott
- Abstract summary: We present our development experience for the Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms.
We use the open-source hls4ml and FINN perJ, which aim to democratize AI- hardware codesign of optimized neural networks on FPGAs.
The solutions are deployed on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms.
- Score: 11.575901540758574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present our development experience and recent results for the MLPerf Tiny
Inference Benchmark on field-programmable gate array (FPGA) platforms. We use
the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware
codesign of optimized neural networks on FPGAs. We present the design and
implementation process for the keyword spotting, anomaly detection, and image
classification benchmark tasks. The resulting hardware implementations are
quantized, configurable, spatial dataflow architectures tailored for speed and
efficiency and introduce new generic optimizations and common workflows
developed as a part of this work. The full workflow is presented from
quantization-aware training to FPGA implementation. The solutions are deployed
on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms. The
resulting submissions achieve latencies as low as 20 $\mu$s and energy
consumption as low as 30 $\mu$J per inference. We demonstrate how emerging ML
benchmarks on heterogeneous hardware platforms can catalyze collaboration and
the development of new techniques and more accessible tools.
Related papers
- Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms [77.71341200638416]
ChiPBench is a benchmark designed to evaluate the effectiveness of AI-based chip placement algorithms.
We have gathered 20 circuits from various domains (e.g., CPU, GPU, and microcontrollers) for evaluation.
Results show that even if intermediate metric of a single-point algorithm is dominant, the final PPA results are unsatisfactory.
arXiv Detail & Related papers (2024-07-03T03:29:23Z) - Investigating Resource-efficient Neutron/Gamma Classification ML Models Targeting eFPGAs [0.0]
Open-source embedded FPGA (eFPGA) frameworks provide an alternate, more flexible pathway for implementing machine learning models in hardware.
We explore the parameter space for eFPGA implementations of fully-connected neural network (fcNN) and boosted decision tree (BDT) models.
The results of the study will be used to aid the specification of an eFPGA fabric, which will be integrated as part of a test chip.
arXiv Detail & Related papers (2024-04-19T20:03:30Z) - Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z) - SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on
FPGA Devices [48.47320494918925]
This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications.
We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion.
We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources.
arXiv Detail & Related papers (2023-09-04T13:15:01Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - End-to-end codesign of Hessian-aware quantized neural networks for FPGAs
and ASICs [49.358119307844035]
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs)
This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow.
We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the Large Hadron Collider (LHC)
We implement an optimized mixed-precision NN for high-momentum particle jets in simulated LHC proton-proton collisions.
arXiv Detail & Related papers (2023-04-13T18:00:01Z) - HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on
FPGA Devices [71.45672882756001]
This study introduces a novel streaming architecture based toolflow for mapping 3D Convolutional Neural Networks onto FPGAs.
The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics.
The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs.
arXiv Detail & Related papers (2023-03-30T08:25:27Z) - VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit
Vision Transformer [121.85581713299918]
We propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized Vision Transformers (ViTs)
Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations.
This is the first time quantization has been incorporated into ViT acceleration on FPGAs.
arXiv Detail & Related papers (2022-01-17T20:27:52Z) - HALF: Holistic Auto Machine Learning for FPGAs [1.9146960682777232]
Deep Neural Networks (DNNs) are capable of solving complex problems in domains related to embedded systems, such as image and natural language processing.
To efficiently implement DNNs on a specific FPGA platform for a given cost criterion, e.g. energy efficiency, an enormous amount of design parameters has to be considered.
An automatic, holistic design approach can improve the quality of DNN implementations on FPGA significantly.
arXiv Detail & Related papers (2021-06-28T14:45:47Z) - Accelerated Charged Particle Tracking with Graph Neural Networks on
FPGAs [0.0]
We develop and study FPGA implementations of algorithms for charged particle tracking based on graph neural networks.
We find a considerable speedup over CPU-based execution is possible, potentially enabling such algorithms to be used effectively in future computing.
arXiv Detail & Related papers (2020-11-30T18:17:43Z) - AutoML for Multilayer Perceptron and FPGA Co-design [0.0]
State-of-the-art Neural Network Architectures (NNAs) are challenging to design and implement efficiently in hardware.
Much of the recent research in the auto-design of NNAs has focused on convolution networks and image recognition.
We develop and test a general multilayer perceptron (MLP) flow that can take arbitrary datasets as input and automatically produce optimized NNAs and hardware designs.
arXiv Detail & Related papers (2020-09-14T02:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.