Related papers: PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning

PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning

URL: http://arxiv.org/abs/2501.08043v1
Date: Tue, 14 Jan 2025 11:51:57 GMT
Title: PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning
Authors: Marta Andronic, Jiawen Li, George A. Constantinides,
Abstract summary: We propose a novel approach to training DNNs for FPGA deployment using CERNs as the basic building block.<n>Our method takes advantage of the flexibility offered by soft logic, hiding the evaluation inside the LUTs with minimal overhead.<n>We demonstrate the effectiveness of PolyLUT on three tasks: network intrusion detection, jet identification at the Large Hadron Collider, and MNIST.
Score: 8.791770352147989
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded these operations inside FPGA lookup tables (LUTs). However, FPGA LUTs can implement a much greater variety of functions. In this paper, we propose a novel approach to training DNNs for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. By using polynomial building blocks, we achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. LUT-based implementations also face a significant challenge: the LUT size grows exponentially with the number of inputs. Prior work relies on a priori fixed sparsity, with results heavily dependent on seed selection. To address this, we propose a structured pruning strategy using a bespoke hardware-aware group regularizer that encourages a particular sparsity pattern that leads to a small number of inputs per neuron. We demonstrate the effectiveness of PolyLUT on three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and MNIST.

Related papers

NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT Inference [2.7086888205833968]
Efficient neural networks (NNs) leveraging lookup tables (LUTs) have demonstrated significant potential for emerging AI applications. Existing LUT-based designs suffer from accuracy degradation due to the large fan-in required by neurons being limited by the exponential scaling of LUT resources with input width. We present NeuraLUT-Assemble, a novel framework that addresses these limitations by combining mixed-precision techniques with the assembly of larger neurons from smaller units.
arXiv Detail & Related papers (2025-04-01T09:52:38Z)
TreeLUT: An Efficient Alternative to Deep Neural Networks for Inference Acceleration Using Gradient Boosted Decision Trees [0.6906005491572401]
We present TreeLUT, an open-source tool for implementing gradient boosted decision trees (GBDTs) on FPGAs.<n>We show the effectiveness of TreeLUT using multiple datasets classification, commonly used to evaluate ultra-low area and latency.<n>Our results show that TreeLUT significantly improves hardware utilization, latency, and throughput at competitive accuracy compared to previous works.
arXiv Detail & Related papers (2025-01-02T19:38:07Z)
PolyLUT-Add: FPGA-based LUT Inference with Wide Inputs [1.730979251211628]
This work introduces PolyLUT-Add, a technique that enhances neuron connectivity by combining $A$ PolyLUT sub-neurons via addition to improve accuracy. We evaluate our implementation over the MNIST, Jet Substructure classification, and Network Intrusion Detection benchmark and found that for similar accuracy, PolyLUT-Add achieves a LUT reduction of $2.0-13.9times$ with a $1.2-1.6times$ decrease in latency.
arXiv Detail & Related papers (2024-06-07T13:00:57Z)
NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions [2.7086888205833968]
Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. We propose relaxing the boundaries of neurons and mapping entire sub-networks to a single LUT. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST.
arXiv Detail & Related papers (2024-02-29T16:10:21Z)
PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference [3.1999570171901786]
We show that by using building blocks, we can achieve the same accuracy using fewer layers of soft logic than by using linear functions. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
arXiv Detail & Related papers (2023-09-05T15:54:09Z)
Regularization of polynomial networks for image recognition [78.4786845859205]
Polynomial Networks (PNs) have emerged as an alternative method with a promising performance and improved interpretability. We introduce a class of PNs, which are able to reach the performance of ResNet across a range of six benchmarks.
arXiv Detail & Related papers (2023-03-24T10:05:22Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. One of the main challenges comes from the real-time implementation of these algorithms. This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z)
Poly-NL: Linear Complexity Non-local Layers with Polynomials [76.21832434001759]
We formulate novel fast NonLocal blocks, capable of reducing complexity from quadratic to linear with no loss in performance. The proposed method, which we dub as "Poly-NL", is competitive with state-of-the-art performance across image recognition, instance segmentation, and face detection tasks.
arXiv Detail & Related papers (2021-07-06T19:51:37Z)
Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers [112.23981192818721]
We propose to use backward mode linear relaxation based analysis (LiRPA) to replace Linear Programming (LP) during the BaB process. Unlike LP, LiRPA when applied naively can produce much weaker bounds and even cannot check certain conflicts of sub-domains during splitting. We demonstrate an order of magnitude speedup compared to existing LP-based approaches.
arXiv Detail & Related papers (2020-11-27T16:42:12Z)
Deep Polynomial Neural Networks [77.70761658507507]
$Pi$Nets are a new class of function approximators based on expansions. $Pi$Nets produce state-the-art results in three challenging tasks, i.e. image generation, face verification and 3D mesh representation learning.
arXiv Detail & Related papers (2020-06-20T16:23:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.