Related papers: Dynamically Reconfigurable Variable-precision Sparse-Dense Matrix Acceleration in Tensorflow Lite

Dynamically Reconfigurable Variable-precision Sparse-Dense Matrix Acceleration in Tensorflow Lite

URL: http://arxiv.org/abs/2304.08211v1
Date: Mon, 17 Apr 2023 12:31:50 GMT
Title: Dynamically Reconfigurable Variable-precision Sparse-Dense Matrix Acceleration in Tensorflow Lite
Authors: Jose Nunez-Yanez, Andres Otero, Eduardo de la Torre
Abstract summary: We present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices) The FADES design offers multiple configuration options that trade off complexity and parallelism using a dataflow model to create four stages that read, compute, scale and write results. We show that the core can outperform dense mode even at low sparsity levels, and a single-core achieves up to 20x acceleration over the software-optimized NEON RUY library.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices). The FADES design offers multiple configuration options that trade off parallelism and complexity using a dataflow model to create four stages that read, compute, scale and write results. FADES is mapped to the programmable logic (PL) and integrated with the TensorFlow Lite inference engine running on the processing system (PS) of a heterogeneous SoC device. The accelerator is used to compute the tensor operations, while the dynamically reconfigurable approach can be used to switch precision between int8 and float modes. This dynamic reconfiguration enables better performance by allowing more cores to be mapped to the resource-constrained device and lower power consumption compared with supporting both arithmetic precisions simultaneously. We compare the proposed hardware with a high-performance systolic architecture for dense matrices obtaining 25% better performance in dense mode with half the DSP blocks in the same technology. In sparse mode, we show that the core can outperform dense mode even at low sparsity levels, and a single-core achieves up to 20x acceleration over the software-optimized NEON RUY library.

Related papers

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
LPViT: Low-Power Semi-structured Pruning for Vision Transformers [43.126752035656196]
Vision transformers have emerged as a promising alternative to convolutional neural networks for image analysis tasks. One significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, complexity, and power consumption. We introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration.
arXiv Detail & Related papers (2024-07-02T08:58:19Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
Stochastic Configuration Machines: FPGA Implementation [4.57421617811378]
configuration networks (SCNs) are a prime choice in industrial applications due to their merits and feasibility for data modelling. This paper aims to implement SCM models on a field programmable gate array (FPGA) and introduce binary-coded inputs to improve learning performance.
arXiv Detail & Related papers (2023-10-30T02:04:20Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation [15.860204740425791]
We propose Permutation Invariant Transformation (PIT) for dynamic sparsity computation. PIT transforms micro-tiles into a GPU-efficient dense tile without changing the results. It can accelerate dynamic sparsity computation by up to 5.9x (average 2.43x) over state-of-the-art compilers.
arXiv Detail & Related papers (2023-01-26T04:50:14Z)
Real-time Hyper-Dimensional Reconfiguration at the Edge using Hardware Accelerators [12.599871451119538]
HyDRATE can perform real-time reconfiguration at the edge using deep neural nets (DNN) combined with hyperdimensional (HD) computing accelerators. We describe the algorithm, trained quantized model generation, and simulated performance of a feature extractor free of multiply-accumulates. We show that reconfigurability in the field is achieved by retraining only the feed-forward HD classifier without descent gradient backpropagation.
arXiv Detail & Related papers (2022-06-10T14:08:41Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels. We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z)
Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM) Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.