Related papers: Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time

Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time

URL: http://arxiv.org/abs/2409.18566v1
Date: Fri, 27 Sep 2024 09:10:44 GMT
Title: Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time
Authors: Matteo Risso, Alessio Burrello, Daniele Jahier Pagliari,
Abstract summary: We present ODiMO, a hardware-aware tool that efficiently explores fine-grain mapping of Deep Neural Networks (DNNs) among various on-chip CUs. We show that ODiMO reduces the latency of a DNN executed on the Darkside by up to 8x at iso-accuracy, compared to a manual mappings. When targeting energy, ODiMO produced up to 50.8x more efficient mappings, with minimal accuracy drop.
Score: 5.05866540830123
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The demand for executing Deep Neural Networks (DNNs) with low latency and minimal power consumption at the edge has led to the development of advanced heterogeneous Systems-on-Chips (SoCs) that incorporate multiple specialized computing units (CUs), such as accelerators. Offloading DNN computations to a specific CU from the available set often exposes accuracy vs efficiency trade-offs, due to differences in their supported operations (e.g., standard vs. depthwise convolution) or data representations (e.g., more/less aggressively quantized). A challenging yet unresolved issue is how to map a DNN onto these multi-CU systems to maximally exploit the parallelization possibilities while taking accuracy into account. To address this problem, we present ODiMO, a hardware-aware tool that efficiently explores fine-grain mapping of DNNs among various on-chip CUs, during the training phase. ODiMO strategically splits individual layers of the neural network and executes them in parallel on the multiple available CUs, aiming to balance the total inference energy consumption or latency with the resulting accuracy, impacted by the unique features of the different hardware units. We test our approach on CIFAR-10, CIFAR-100, and ImageNet, targeting two open-source heterogeneous SoCs, i.e., DIANA and Darkside. We obtain a rich collection of Pareto-optimal networks in the accuracy vs. energy or latency space. We show that ODiMO reduces the latency of a DNN executed on the Darkside SoC by up to 8x at iso-accuracy, compared to manual heuristic mappings. When targeting energy, on the same SoC, ODiMO produced up to 50.8x more efficient mappings, with minimal accuracy drop (< 0.3%).

Related papers

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort. DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z)
Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators [0.20971479389679332]
Energy efficiency and memory footprint of a convolutional neural network (CNN) implemented on a CNN inference accelerator depend on many factors. We show that enabling rich mixed quantization schemes during the implementation can open a previously hidden space of mappings. CNNs utilizing quantized weights and activations and suitable mappings can significantly improve trade-offs among the accuracy, energy, and memory requirements.
arXiv Detail & Related papers (2024-04-08T10:10:30Z)
Hardware-Aware DNN Compression via Diverse Pruning and Mixed-Precision Quantization [1.0235078178220354]
We propose an automated framework to compress Deep Neural Networks (DNNs) in a hardware-aware manner by jointly employing pruning and quantization. Our framework achieves $39%$ average energy reduction for datasets $1.7%$ average accuracy loss and outperforms significantly the state-of-the-art approaches.
arXiv Detail & Related papers (2023-12-23T18:50:13Z)
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks. It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z)
Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips [0.32634122554914]
HaX-CoNN is a novel scheme that characterizes and maps layers in concurrently executing inference workloads. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SOCs.
arXiv Detail & Related papers (2023-08-10T22:47:40Z)
DiviML: A Module-based Heuristic for Mapping Neural Networks onto Heterogeneous Platforms [5.970091958678456]
We develop an approach for compiler-level partitioning of deep neural networks (DNNs) onto multiple interconnected hardware devices. Our scheduler integrates both an exact solver, through a mixed integer linear programming (MILP) formulation, and a modularity-based runtime. We show how we can extend our framework to schedule large language models across multiple heterogeneous servers.
arXiv Detail & Related papers (2023-07-31T19:46:49Z)
Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference [22.9834921448069]
We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual mappings.
arXiv Detail & Related papers (2023-06-08T09:23:46Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks [29.46621102184345]
We propose a framework dubbed DepthShrinker to develop hardware-friendly compact networks. Our framework delivers hardware-friendly compact networks that outperform both state-of-the-art efficient DNNs and compression techniques.
arXiv Detail & Related papers (2022-06-02T02:32:47Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.