HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point
Operations for Convolutional Neural Networks
- URL: http://arxiv.org/abs/2007.06563v3
- Date: Sun, 28 Feb 2021 16:52:38 GMT
- Title: HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point
Operations for Convolutional Neural Networks
- Authors: James Garland, David Gregg
- Abstract summary: Convolutional neural networks (CNNs) are typically trained using 16- or 32-bit floating-point (FP)
Low-precision floating-point (FP) can be highly effective for inference.
Existing processors do not generally support custom precision FP.
We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS)
- Score: 0.2148535041822524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks (CNNs) are typically trained using 16- or
32-bit floating-point (FP) and researchers show that low-precision
floating-point (FP) can be highly effective for inference. Low-precision FP can
be implemented in field programmable gate array (FPGA) and application-specific
integrated circuit (ASIC) accelerators, but existing processors do not
generally support custom precision FP. We propose hardware optimized
bitslice-parallel floating-point operators (HOBFLOPS), a method of generating
efficient custom-precision emulated bitslice-parallel software FP arithmetic.
We generate custom-precision FP routines optimized using a hardware synthesis
design flow to create circuits. We provide standard cell libraries matching the
bitwise operations on the target microprocessor architecture, and a
code-generator to translate the hardware circuits to bitslice software
equivalents. We exploit bitslice parallelism to create a very wide (32-512
element) vectorized convolutional neural network (CNN) convolution. Hardware
optimized bitslice-parallel floating-point operators (HOBFLOPS)
multiply-accumulate (MAC) performance in CNN convolution on Arm and Intel
processors are compared to Berkeley's SoftFP16 equivalent MAC. HOBFLOPS16
outperforms SoftFP16 by 8x on Intel AVX512. HOBFLOPS offers arbitrary-precision
FP with custom range and precision e.g., HOBFLOPS9 performs at 6x the
performance of HOBFLOPS16 on Arm Neon. HOBFLOPS allows researchers to prototype
different levels of custom FP precision in the arithmetic of software CNN
accelerators. Furthermore, HOBFLOPS fast custom-precision FP CNNs may be
valuable in cases where memory bandwidth is limited.
Related papers
- BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices [14.536949788395837]
Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden.
We develop a BFP-based bitwidth-aware analytical modeling framework (called BitQ'') for the best BFP implementation of DNN inference on embedded platforms.
arXiv Detail & Related papers (2024-09-25T17:03:49Z) - Fast Algorithms for Spiking Neural Network Simulation with FPGAs [0.0]
We create spiking neural network (SNN) simulators for the Potjans-Diesmann cortical microcircuit for a high-end Field-Programmable Gate Array (FPGA)
Our best simulators simulate the circuit 25% faster than real-time, require less than 21 nJ per synaptic event, and are bottle-necked by the device's on-chip memory.
This result is the first for simulating the circuit on a single hardware accelerator.
arXiv Detail & Related papers (2024-05-03T11:39:25Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - End-to-end codesign of Hessian-aware quantized neural networks for FPGAs
and ASICs [49.358119307844035]
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs)
This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow.
We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the Large Hadron Collider (LHC)
We implement an optimized mixed-precision NN for high-momentum particle jets in simulated LHC proton-proton collisions.
arXiv Detail & Related papers (2023-04-13T18:00:01Z) - HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on
FPGA Devices [71.45672882756001]
This study introduces a novel streaming architecture based toolflow for mapping 3D Convolutional Neural Networks onto FPGAs.
The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics.
The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs.
arXiv Detail & Related papers (2023-03-30T08:25:27Z) - Optimization of FPGA-based CNN Accelerators Using Metaheuristics [1.854931308524932]
convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields.
FPGAs have seen a surge in interest for accelerating CNN inference.
Current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs)
arXiv Detail & Related papers (2022-09-22T18:57:49Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - A fully pipelined FPGA accelerator for scale invariant feature transform
keypoint descriptor matching, [0.0]
We design a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching.
The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation.
Our hardware implementation is 15.7 times faster than the comparable software approach.
arXiv Detail & Related papers (2020-12-17T15:29:41Z) - A Learning Framework for n-bit Quantized Neural Networks toward FPGAs [20.83904734716565]
This paper proposes a novel learning framework for n-bit QNNs, whose weights are constrained to the power of two.
We also propose a novel QNN structure named n-BQ-NN, which uses shift operation to replace the multiply operation.
Experiments show that our n-BQ-NN with our SVPE can execute 2.9 times faster than with the vector processing element (VPE) in inference.
arXiv Detail & Related papers (2020-04-06T04:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.