Shared Memory-contention-aware Concurrent DNN Execution for Diversely
Heterogeneous System-on-Chips
- URL: http://arxiv.org/abs/2308.05869v2
- Date: Wed, 7 Feb 2024 01:50:14 GMT
- Title: Shared Memory-contention-aware Concurrent DNN Execution for Diversely
Heterogeneous System-on-Chips
- Authors: Ismet Dagli, Mehmet Belviranli
- Abstract summary: HaX-CoNN is a novel scheme that characterizes and maps layers in concurrently executing inference workloads.
We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SOCs.
- Score: 0.32634122554914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Two distinguishing features of state-of-the-art mobile and autonomous systems
are 1) there are often multiple workloads, mainly deep neural network (DNN)
inference, running concurrently and continuously; and 2) they operate on shared
memory system-on-chips (SoC) that embed heterogeneous accelerators tailored for
specific operations. State-of-the-art lacks efficient performance and resource
management techniques necessary to either maximize total system throughput or
minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a
novel scheme that characterizes and maps layers in concurrently executing DNN
inference workloads to a diverse set of accelerators within a SoC. Our scheme
uniquely takes per-layer execution characteristics, shared memory (SM)
contention, and inter-accelerator transitions into account to find optimal
schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm
Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN minimizes
memory contention by up to 45% and can improve latency and total throughput by
up to 32% and 29%, respectively, compared to the state-of-the-art approaches.
Related papers
- Neuromorphic Wireless Split Computing with Multi-Level Spikes [69.73249913506042]
In neuromorphic computing, spiking neural networks (SNNs) perform inference tasks, offering significant efficiency gains for workloads involving sequential data.
Recent advances in hardware and software have demonstrated that embedding a few bits of payload in each spike exchanged between the spiking neurons can further enhance inference accuracy.
This paper investigates a wireless neuromorphic split computing architecture employing multi-level SNNs.
arXiv Detail & Related papers (2024-11-07T14:08:35Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time [5.05866540830123]
We present ODiMO, a hardware-aware tool that efficiently explores fine-grain mapping of Deep Neural Networks (DNNs) among various on-chip CUs.
We show that ODiMO reduces the latency of a DNN executed on the Darkside by up to 8x at iso-accuracy, compared to a manual mappings.
When targeting energy, ODiMO produced up to 50.8x more efficient mappings, with minimal accuracy drop.
arXiv Detail & Related papers (2024-09-27T09:10:44Z) - Dynamic DNNs and Runtime Management for Efficient Inference on
Mobile/Embedded Devices [2.8851756275902476]
Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms.
We co-designed novel Dynamic Super-Networks to maximise system-level performance and energy efficiency.
Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency.
arXiv Detail & Related papers (2024-01-17T04:40:30Z) - Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - An efficient and flexible inference system for serving heterogeneous
ensembles of deep neural networks [0.0]
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive.
We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
arXiv Detail & Related papers (2022-08-30T08:05:43Z) - Sub-bit Neural Networks: Learning to Compress and Accelerate Binary
Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs.
SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space.
Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z) - Towards Memory-Efficient Neural Networks via Multi-Level in situ
Generation [10.563649948220371]
Deep neural networks (DNN) have shown superior performance in a variety of tasks.
As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices.
We propose a general and unified framework to trade expensive memory transactions with ultra-fast on-chip computations.
arXiv Detail & Related papers (2021-08-25T18:50:24Z) - Deep Learning-based Resource Allocation For Device-to-Device
Communication [66.74874646973593]
We propose a framework for the optimization of the resource allocation in multi-channel cellular systems with device-to-device (D2D) communication.
A deep learning (DL) framework is proposed, where the optimal resource allocation strategy for arbitrary channel conditions is approximated by deep neural network (DNN) models.
Our simulation results confirm that near-optimal performance can be attained with low time, which underlines the real-time capability of the proposed scheme.
arXiv Detail & Related papers (2020-11-25T14:19:23Z) - TASO: Time and Space Optimization for Memory-Constrained DNN Inference [5.023660118588569]
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices.
We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers.
arXiv Detail & Related papers (2020-05-21T15:08:06Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.