Related papers: ElasticAI: Creating and Deploying Energy-Efficient Deep Learning Accelerator for Pervasive Computing

ElasticAI: Creating and Deploying Energy-Efficient Deep Learning Accelerator for Pervasive Computing

URL: http://arxiv.org/abs/2409.09044v1
Date: Thu, 29 Aug 2024 12:39:44 GMT
Title: ElasticAI: Creating and Deploying Energy-Efficient Deep Learning Accelerator for Pervasive Computing
Authors: Chao Qian, Tianheng Ling, Gregor Schiele,
Abstract summary: Deep Learning (DL) on embedded devices is a hot trend in pervasive computing. FPGAs are suitable for deploying DL accelerators for embedded devices, but developing an energy-efficient DL accelerator on an FPGA is not easy. We propose the ElasticAI-Workflow that aims to help DL developers to create and deploy DL models as hardware accelerators on embedded FPGAs.
Score: 19.835810073852244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying Deep Learning (DL) on embedded end devices is a scorching trend in pervasive computing. Since most Microcontrollers on embedded devices have limited computing power, it is necessary to add a DL accelerator. Embedded Field Programmable Gate Arrays (FPGAs) are suitable for deploying DL accelerators for embedded devices, but developing an energy-efficient DL accelerator on an FPGA is not easy. Therefore, we propose the ElasticAI-Workflow that aims to help DL developers to create and deploy DL models as hardware accelerators on embedded FPGAs. This workflow consists of two key components: the ElasticAI-Creator and the Elastic Node. The former is a toolchain for automatically generating DL accelerators on FPGAs. The latter is a hardware platform for verifying the performance of the generated accelerators. With this combination, the performance of the accelerator can be sufficiently guaranteed. We will demonstrate the potential of our approach through a case study.

Related papers

LLM-Aided Compilation for Tensor Accelerators [6.709490736813537]
We discuss how large language models (LLMs) could be leveraged to build a compiler for hardware accelerators. Specifically, we demonstrate the ability of GPT-4 to achieve high pass rates in translating code to the Gemmini accelerator. We also propose a 2-phase workflow for utilizing LLMs to generate hardware-optimized code.
arXiv Detail & Related papers (2024-08-06T19:10:25Z)
Designing Efficient LLM Accelerators for Edge Devices [1.4128048241287314]
Large Language Models (LLMs) can be deployed on resource-constrained edge devices to reduce reliance on network connections and provide more privacy. To address this issue, designing new and efficient edge accelerators for LLM inference is crucial. We propose SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators.
arXiv Detail & Related papers (2024-08-01T11:06:05Z)
Using the Abstract Computer Architecture Description Language to Model AI Hardware Accelerators [77.89070422157178]
Manufacturers of AI-integrated products face a critical challenge: selecting an accelerator that aligns with their product's performance requirements. The Abstract Computer Architecture Description Language (ACADL) is a concise formalization of computer architecture block diagrams. In this paper, we demonstrate how to use the ACADL to model AI hardware accelerators, use their ACADL description to map DNNs onto them, and explain the timing simulation semantics to gather performance results.
arXiv Detail & Related papers (2024-01-30T19:27:16Z)
SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices [48.47320494918925]
This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources.
arXiv Detail & Related papers (2023-09-04T13:15:01Z)
A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms [9.036774656254375]
This survey summarizes and classifies the most recent advances in designing deep learning accelerators. It highlights the most advanced approaches to support deep learning accelerations including not only GPU and TPU-based accelerators but also design-specific hardware accelerators. The survey also describes accelerators based on emerging memory technologies and computing paradigms.
arXiv Detail & Related papers (2023-06-27T15:24:24Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads [86.62083829086393]
This work introduces the Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of Deep Learning-workloads with high-productivity. TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual ISA), which can be utilized as building-blocks to construct complex operators on high-dimensional tensors. We demonstrate the efficacy of our approach using standalone kernels and end-to-end DL-workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms.
arXiv Detail & Related papers (2021-04-12T18:35:49Z)
Neighbors From Hell: Voltage Attacks Against Deep Learning Accelerators on Multi-Tenant FPGAs [13.531406531429335]
We evaluate the security of FPGA-based deep learning accelerators against voltage-based integrity attacks. We show that aggressive clock gating, an effective power-saving technique, can also be a potential security threat in modern FPGAs. We achieve 1.18-1.31x higher inference performance by over-clocking the DL accelerator without affecting its prediction accuracy.
arXiv Detail & Related papers (2020-12-14T03:59:08Z)
Optimizing Memory-Access Patterns for Deep Learning Accelerators [6.931196464448543]
Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost. Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads. It is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a software-managed scratchpad memory. This paper proposes a systematic approach which leverages the polyhedral model to analyze all operators of a DL model together to minimize the number of memory accesses.
arXiv Detail & Related papers (2020-02-27T05:06:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.