Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in
the Cloud
- URL: http://arxiv.org/abs/2003.12101v1
- Date: Thu, 26 Mar 2020 18:34:11 GMT
- Title: Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in
the Cloud
- Authors: Shulin Zeng, Guohao Dai, Hanbo Sun, Kai Zhong, Guangjun Ge, Kaiyuan
Guo, Yu Wang, Huazhong Yang
- Abstract summary: FPGAs have shown great potential in providing low-latency and energy-efficient solutions for deep neural network (DNN) inference applications.
Currently, the majority of FPGA-based DNN accelerators in the cloud run in a time-division multiplexing way for multiple users sharing a single FPGA, and require re-compilation with $sim$100 s overhead.
- Score: 13.439004162406063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: FPGAs have shown great potential in providing low-latency and
energy-efficient solutions for deep neural network (DNN) inference
applications. Currently, the majority of FPGA-based DNN accelerators in the
cloud run in a time-division multiplexing way for multiple users sharing a
single FPGA, and require re-compilation with $\sim$100 s overhead. Such designs
lead to poor isolation and heavy performance loss for multiple users, which are
far away from providing efficient and flexible FPGA virtualization for neither
public nor private cloud scenarios.
To solve these problems, we introduce a novel virtualization framework for
instruction architecture set (ISA) based on DNN accelerators by sharing a
single FPGA. We enable the isolation by introducing a two-level instruction
dispatch module and a multi-core based hardware resources pool. Such designs
provide isolated and runtime-programmable hardware resources, further leading
to performance isolation for multiple users. On the other hand, to overcome the
heavy re-compilation overheads, we propose a tiling-based instruction frame
package design and two-stage static-dynamic compilation. Only the light-weight
runtime information is re-compiled with $\sim$1 ms overhead, thus the
performance is guaranteed for the private cloud. Our extensive experimental
results show that the proposed virtualization design achieves 1.07-1.69x and
1.88-3.12x throughput improvement over previous static designs using the
single-core and the multi-core architectures, respectively.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.
DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.
Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Optimization of FPGA-based CNN Accelerators Using Metaheuristics [1.854931308524932]
convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields.
FPGAs have seen a surge in interest for accelerating CNN inference.
Current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs)
arXiv Detail & Related papers (2022-09-22T18:57:49Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN
Accelerators for Edge Inference [0.0]
We propose SECDA, a new hardware/software co-design methodology to reduce design time of optimized Deep Neural Networks (DNN) inference accelerators on edge devices with FPGAs.
We use SECDA to efficiently develop two different DNN accelerator designs on a PYNQ-Z1 board, a platform that includes an edge FPGA.
We evaluate the two accelerator designs with four common DNN models, achieving an average performance speedup across models of up to 3.5$times$ with a 2.9$times$ reduction in energy consumption over CPU-only inference.
arXiv Detail & Related papers (2021-10-01T15:20:29Z) - Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA
Accelerator Architecture for Accelerating Convolutional Neural Network
Inference in Cloud/Edge Computing [8.826181951806928]
Systolic-CNN is an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture.
Systolic-CNN is optimized for accelerating the inference of various convolutional neural networks (CNNs) in multi-tenancy cloud/edge computing.
arXiv Detail & Related papers (2020-12-06T03:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.