Dynamically Reconfigurable Variable-precision Sparse-Dense Matrix
Acceleration in Tensorflow Lite
- URL: http://arxiv.org/abs/2304.08211v1
- Date: Mon, 17 Apr 2023 12:31:50 GMT
- Title: Dynamically Reconfigurable Variable-precision Sparse-Dense Matrix
Acceleration in Tensorflow Lite
- Authors: Jose Nunez-Yanez, Andres Otero, Eduardo de la Torre
- Abstract summary: We present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices)
The FADES design offers multiple configuration options that trade off complexity and parallelism using a dataflow model to create four stages that read, compute, scale and write results.
We show that the core can outperform dense mode even at low sparsity levels, and a single-core achieves up to 20x acceleration over the software-optimized NEON RUY library.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a dynamically reconfigurable hardware accelerator
called FADES (Fused Architecture for DEnse and Sparse matrices). The FADES
design offers multiple configuration options that trade off parallelism and
complexity using a dataflow model to create four stages that read, compute,
scale and write results. FADES is mapped to the programmable logic (PL) and
integrated with the TensorFlow Lite inference engine running on the processing
system (PS) of a heterogeneous SoC device. The accelerator is used to compute
the tensor operations, while the dynamically reconfigurable approach can be
used to switch precision between int8 and float modes. This dynamic
reconfiguration enables better performance by allowing more cores to be mapped
to the resource-constrained device and lower power consumption compared with
supporting both arithmetic precisions simultaneously. We compare the proposed
hardware with a high-performance systolic architecture for dense matrices
obtaining 25% better performance in dense mode with half the DSP blocks in the
same technology. In sparse mode, we show that the core can outperform dense
mode even at low sparsity levels, and a single-core achieves up to 20x
acceleration over the software-optimized NEON RUY library.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Stochastic Configuration Machines: FPGA Implementation [4.57421617811378]
configuration networks (SCNs) are a prime choice in industrial applications due to their merits and feasibility for data modelling.
This paper aims to implement SCM models on a field programmable gate array (FPGA) and introduce binary-coded inputs to improve learning performance.
arXiv Detail & Related papers (2023-10-30T02:04:20Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation
Invariant Transformation [15.860204740425791]
We propose Permutation Invariant Transformation (PIT) for dynamic sparsity computation.
PIT transforms micro-tiles into a GPU-efficient dense tile without changing the results.
It can accelerate dynamic sparsity computation by up to 5.9x (average 2.43x) over state-of-the-art compilers.
arXiv Detail & Related papers (2023-01-26T04:50:14Z) - Real-time Hyper-Dimensional Reconfiguration at the Edge using Hardware
Accelerators [12.599871451119538]
HyDRATE can perform real-time reconfiguration at the edge using deep neural nets (DNN) combined with hyperdimensional (HD) computing accelerators.
We describe the algorithm, trained quantized model generation, and simulated performance of a feature extractor free of multiply-accumulates.
We show that reconfigurability in the field is achieved by retraining only the feed-forward HD classifier without descent gradient backpropagation.
arXiv Detail & Related papers (2022-06-10T14:08:41Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.