SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable
Accuracy
- URL: http://arxiv.org/abs/2011.01148v1
- Date: Mon, 2 Nov 2020 17:40:44 GMT
- Title: SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable
Accuracy
- Authors: Zahra Ebrahimi and Salim Ullah and Akash Kumar
- Abstract summary: This paper presents for the first time, an SIMD architecture based on novel multiplier and divider with tunable accuracy.
The proposed hybrid architecture implements Mitchell's algorithms and supports precision variability from 8 to 32 bits.
- Score: 3.4154033825543055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ever-increasing quest for data-level parallelism and variable precision
in ubiquitous multimedia and Deep Neural Network (DNN) applications has
motivated the use of Single Instruction, Multiple Data (SIMD) architectures. To
alleviate energy as their main resource constraint, approximate computing has
re-emerged,albeit mainly specialized for their Application-Specific Integrated
Circuit (ASIC) implementations. This paper, presents for the first time, an
SIMD architecture based on novel multiplier and divider with tunable accuracy,
targeted for Field-Programmable Gate Arrays (FPGAs). The proposed hybrid
architecture implements Mitchell's algorithms and supports precision
variability from 8 to 32 bits. Experimental results obtained from Vivado,
multimedia and DNN applications indicate superiority of proposed architecture
(both SISD and SIMD) over accurate and state-of-the-art approximate
counterparts. In particular, the proposed SISD divider outperforms the accurate
Intellectual Property (IP) divider provided by Xilinx with 4x higher speed and
4.6x less energy and tolerating only < 0.8% error. Moreover, the proposed SIMD
multiplier-divider supersede accurate SIMD multiplier by achieving up to 26%,
45%, 36%, and 56% improvement in area, throughput, power, and energy,
respectively.
Related papers
- An Open-Source Framework for Efficient Numerically-Tailored Computations [1.0596516362730137]
We present a versatile open-source framework designed to facilitate efficient, numerically-tailored Matrix-Matrix multiplications.
For AI inference, we consider a set of state-of-the-art neural network models, namely ResNet18, ResNet34, ResNet50, DenseNet121, DenseNet161, DenseNet169, and VGG11.
Our approach consistently reduces energy consumption across all cases, with a notable example being the reduction by factors of $3.3times$ for IEEE754-32 and $1.4times$ for Bfloat16.
arXiv Detail & Related papers (2024-05-29T10:10:53Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - XPert: Peripheral Circuit & Neural Architecture Co-search for Area and
Energy-efficient Xbar-based Computing [13.499706125321605]
XPert co-searches network architecture and peripheral parameters to achieve optimal performance.
Compared to VGG16 baselines, XPert 10.24x (4.7x) lower EDAP, 1.72x (1.62x) higher TOPS/W,1.93x (3x) higher TOPS/mm2 at 92.46% (56.7%) accuracy for CIFAR10 datasets.
arXiv Detail & Related papers (2023-03-30T18:23:20Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Reconfigurable co-processor architecture with limited numerical
precision to accelerate deep convolutional neural networks [0.38848561367220275]
Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc.
Here, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs.
In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations.
arXiv Detail & Related papers (2021-08-21T09:50:54Z) - Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators [2.6487352458568507]
We propose a mixed-precision convolution unit architecture which supports different integer and floating point (FP) precisions.
We show how to integrate FP computations on integer-based architecture and evaluate overheads incurred by FP arithmetic support.
arXiv Detail & Related papers (2021-01-27T23:57:43Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - Binary DAD-Net: Binarized Driveable Area Detection Network for
Autonomous Driving [94.40107679615618]
This paper proposes a novel binarized driveable area detection network (binary DAD-Net)
It uses only binary weights and activations in the encoder, the bottleneck, and the decoder part.
It outperforms state-of-the-art semantic segmentation networks on public datasets.
arXiv Detail & Related papers (2020-06-15T07:09:01Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.