Efficient Softmax Approximation for Deep Neural Networks with Attention
Mechanism
- URL: http://arxiv.org/abs/2111.10770v1
- Date: Sun, 21 Nov 2021 08:56:29 GMT
- Title: Efficient Softmax Approximation for Deep Neural Networks with Attention
Mechanism
- Authors: Ihor Vasyltsov, Wooseok Chang
- Abstract summary: We propose two methods to approximate softmax computation, which are based on the usage of LookUp Tables (LUTs)
We show that 8-bit approximation allows to obtain acceptable accuracy loss below $1.0%$.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been a rapid advance of custom hardware (HW) for accelerating the
inference speed of deep neural networks (DNNs). Previously, the softmax layer
was not a main concern of DNN accelerating HW, because its portion is
relatively small in multi-layer perceptron or convolutional neural networks.
However, as the attention mechanisms are widely used in various modern DNNs, a
cost-efficient implementation of softmax layer is becoming very important. In
this paper, we propose two methods to approximate softmax computation, which
are based on the usage of LookUp Tables (LUTs). The required size of LUT is
quite small (about 700 Bytes) because ranges of numerators and denominators of
softmax are stable if normalization is applied to the input. We have validated
the proposed technique over different AI tasks (object detection, machine
translation, sentiment analysis, and semantic equivalence) and DNN models
(DETR, Transformer, BERT) by a variety of benchmarks (COCO17, WMT14, WMT17,
GLUE). We showed that 8-bit approximation allows to obtain acceptable accuracy
loss below $1.0\%$.
Related papers
- BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling [12.43240392025487]
We introduce a novel low-precision workflow that employs a specific 8-bit floating-point format (HiF8) and block-aware precision rescaling for softmax.<n>Our algorithmic innovations make low-precision softmax feasible without the significant model accuracy loss.<n>Our work paves the way for doubling end-to-end inference throughput without increasing chip area.
arXiv Detail & Related papers (2026-02-02T13:12:18Z) - PiC-BNN: A 128-kbit 65 nm Processing-in-CAM-Based End-to-End Binary Neural Network Accelerator [1.4777718769290524]
We propose PiC-BNN, a true end-to-end binary in-approximate search (Hamming distance tolerant) content addressable memory based BNN accelerator.<n>PiC-BNN uses Hamming distance tolerance to apply the law of large numbers to enable accurate classification without implementing full precision operations.
arXiv Detail & Related papers (2026-01-08T19:33:57Z) - Compressing Deep Neural Networks Using Explainable AI [0.0]
A novel compression approach using XAI is proposed to efficiently reduce the model size with negligible accuracy loss.<n>The experimental results show that, the proposed compression approach reduces the model size by 64% while the accuracy is improved by 42%.
arXiv Detail & Related papers (2025-07-04T21:45:34Z) - Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation [2.649410674489787]
Spiking Neural Networks (SNNs) are highly energy-efficient during inference.
Their ability to process event-driven inputs, such as data from dynamic vision sensors (DVS), further enhances their applicability to edge computing tasks.
We present the first benchmark for the DVS gesture recognition task using SNNs optimized for the many-core neuromorphic chip SpiNNaker2.
arXiv Detail & Related papers (2025-04-09T10:09:29Z) - Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation [51.235583545740674]
MaxLin is a robustness verifier for MaxPool-based CNNs with tight linear approximation.
We evaluate MaxLin with open-sourced benchmarks, including LeNet and networks trained on the MNIST, CIFAR-10, and Tiny ImageNet datasets.
arXiv Detail & Related papers (2024-06-02T10:33:04Z) - NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions [2.7086888205833968]
Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks.
We propose relaxing the boundaries of neurons and mapping entire sub-networks to a single LUT.
We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST.
arXiv Detail & Related papers (2024-02-29T16:10:21Z) - An Automata-Theoretic Approach to Synthesizing Binarized Neural Networks [13.271286153792058]
Quantized neural networks (QNNs) have been developed, with binarized neural networks (BNNs) restricted to binary values as a special case.
This paper presents an automata-theoretic approach to synthesizing BNNs that meet designated properties.
arXiv Detail & Related papers (2023-07-29T06:27:28Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - SpikeMS: Deep Spiking Neural Network for Motion Segmentation [7.491944503744111]
textitSpikeMS is the first deep encoder-decoder SNN architecture for the real-world large-scale problem of motion segmentation.
We show that textitSpikeMS is capable of textitincremental predictions, or predictions from smaller amounts of test data than it is trained on.
arXiv Detail & Related papers (2021-05-13T21:34:55Z) - Binary Graph Neural Networks [69.51765073772226]
Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data.
In this paper, we present and evaluate different strategies for the binarization of graph neural networks.
We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks.
arXiv Detail & Related papers (2020-12-31T18:48:58Z) - Ax-BxP: Approximate Blocked Computation for Precision-Reconfigurable
Deep Neural Network Acceleration [3.7371886886933487]
Precision scaling has emerged as a popular technique to optimize the compute and storage requirements of Deep Neural Networks (DNNs)
Efforts toward creating ultra-low-precision (sub-8-bit) DNNs suggest that the minimum precision required to achieve a given network-level accuracy varies considerably across networks.
Previous proposals such as bit-serial hardware incur high overheads, significantly diminishing the benefits of lower precision.
arXiv Detail & Related papers (2020-11-25T20:00:38Z) - AutoPruning for Deep Neural Network with Dynamic Channel Masking [28.018077874687343]
We propose a learning based auto pruning algorithm for deep neural network.
A two objectives' problem that aims for the the weights and the best channels for each layer is first formulated.
An alternative optimization approach is then proposed to derive the optimal channel numbers and weights simultaneously.
arXiv Detail & Related papers (2020-10-22T20:12:46Z) - FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts.
In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2.
We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Approximation and Non-parametric Estimation of ResNet-type Convolutional
Neural Networks [52.972605601174955]
We show a ResNet-type CNN can attain the minimax optimal error rates in important function classes.
We derive approximation and estimation error rates of the aformentioned type of CNNs for the Barron and H"older classes.
arXiv Detail & Related papers (2019-03-24T19:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.