A Full-stack Accelerator Search Technique for Vision Applications
- URL: http://arxiv.org/abs/2105.12842v1
- Date: Wed, 26 May 2021 21:10:20 GMT
- Title: A Full-stack Accelerator Search Technique for Vision Applications
- Authors: Dan Zhang, Safeen Huda, Ebrahim Songhori, Quoc Le, Anna Goldie, Azalia
Mirhoseini
- Abstract summary: We propose a hardware accelerator search framework that defines a broad optimization environment.
FAST can be used on any number and type of deep learning workload.
Designs generated by FAST optimized for single workloads can improve Perf/TDP by over 6x in the best case.
On a limited workload subset, FAST improves Perf/TDP 2.85x on average, with a reduction to 2.35x for a single design optimized over the set of workloads.
- Score: 11.932331630567512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapidly-changing ML model landscape presents a unique opportunity for
building hardware accelerators optimized for specific datacenter-scale
workloads. We propose Full-stack Accelerator Search Technique (FAST), a
hardware accelerator search framework that defines a broad optimization
environment covering key design decisions within the hardware-software stack,
including hardware datapath, software scheduling, and compiler passes such as
operation fusion and tensor padding. Although FAST can be used on any number
and type of deep learning workload, in this paper we focus on optimizing for a
single or small set of vision models, resulting in significantly faster and
more power-efficient designs relative to a general purpose ML accelerator. When
evaluated on EfficientNet, ResNet50v2, and OCR inference performance relative
to a TPU-v3, designs generated by FAST optimized for single workloads can
improve Perf/TDP (peak power) by over 6x in the best case and 4x on average. On
a limited workload subset, FAST improves Perf/TDP 2.85x on average, with a
reduction to 2.35x for a single design optimized over the set of workloads. In
addition, we demonstrate a potential 1.8x speedup opportunity for TPU-v3 with
improved scheduling.
Related papers
- Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models.
We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z) - FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources [45.40926501138365]
We introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques.
Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead.
We benchmark the performance of FastCLIP and the state-of-the-art training baseline on different compute scales.
arXiv Detail & Related papers (2024-07-01T16:37:18Z) - Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform [13.326025546527784]
We present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform.
For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version.
For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode.
arXiv Detail & Related papers (2024-05-29T17:16:59Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs.
Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z) - Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge
Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream.
Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - Searching for Fast Model Families on Datacenter Accelerators [33.28421782921072]
We search for fast and accurate CNN model families for efficient inference on DC accelerators.
We propose a latency-aware compound scaling (LACS) method optimizing both accuracy and latency.
Our LACS discovers that network depth should grow much faster than image size and network width.
arXiv Detail & Related papers (2021-02-10T18:15:40Z) - NPAS: A Compiler-aware Framework of Unified Network Pruning and
Architecture Search for Beyond Real-Time Mobile Acceleration [48.25487285358816]
We propose a compiler automatic code generation framework supporting different DNNs and different pruning schemes.
We also propose NPAS, a compiler-aware unified network pruning, and architecture search.
Our framework achieves 6.7ms, 5.9ms, 3.9ms ImageNet inference times with 78.2%, 75% (MobileNet-V3 level), and 71% (MobileNet-V2 level) Top-1 accuracy respectively on an off-the-shelf mobile phone.
arXiv Detail & Related papers (2020-12-01T16:03:40Z) - Towards High Performance, Portability, and Productivity: Lightweight
Augmented Neural Networks for Performance Prediction [0.0]
We propose lightweight augmented neural networks for arbitrary combinations of kernel-variant- hardware.
We are able to obtain a low MAPE of 3%, significantly outperforming traditional feed-forward neural networks.
Our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide's auto-scheduler.
arXiv Detail & Related papers (2020-03-17T02:19:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.