An efficient and flexible inference system for serving heterogeneous
ensembles of deep neural networks
- URL: http://arxiv.org/abs/2208.14049v1
- Date: Tue, 30 Aug 2022 08:05:43 GMT
- Title: An efficient and flexible inference system for serving heterogeneous
ensembles of deep neural networks
- Authors: Pierrick Pochelu, Serge G. Petiton, Bruno Conche
- Abstract summary: Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive.
We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensembles of Deep Neural Networks (DNNs) have achieved qualitative
predictions but they are computing and memory intensive. Therefore, the demand
is growing to make them answer a heavy workload of requests with available
computational resources. Unlike recent initiatives on inference servers and
inference frameworks, which focus on the prediction of single DNNs, we propose
a new software layer to serve with flexibility and efficiency ensembles of
DNNs.
Our inference system is designed with several technical innovations. First,
we propose a novel procedure to find a good allocation matrix between devices
(CPUs or GPUs) and DNN instances. It runs successively a worst-fit to allocate
DNNs into the memory devices and a greedy algorithm to optimize allocation
settings and speed up the ensemble. Second, we design the inference system
based on multiple processes to run asynchronously: batching, prediction, and
the combination rule with an efficient internal communication scheme to avoid
overhead.
Experiments show the flexibility and efficiency under extreme scenarios: It
successes to serve an ensemble of 12 heavy DNNs into 4 GPUs and at the
opposite, one single DNN multi-threaded into 16 GPUs. It also outperforms the
simple baseline consisting of optimizing the batch size of DNNs by a speedup up
to 2.7X on the image classification task.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Edge AI as a Service with Coordinated Deep Neural Networks [0.24578723416255746]
CoDE aims to find the optimal path, which is the path with the highest possible reward, by creating multi-task DNNs from individual models.
Experiments show that CoDE enhances the inference throughput and, achieves higher precision compared to a state-of-the-art existing method.
arXiv Detail & Related papers (2024-01-01T01:54:53Z) - DiviML: A Module-based Heuristic for Mapping Neural Networks onto
Heterogeneous Platforms [5.970091958678456]
We develop an approach for compiler-level partitioning of deep neural networks (DNNs) onto multiple interconnected hardware devices.
Our scheduler integrates both an exact solver, through a mixed integer linear programming (MILP) formulation, and a modularity-based runtime.
We show how we can extend our framework to schedule large language models across multiple heterogeneous servers.
arXiv Detail & Related papers (2023-07-31T19:46:49Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Receptive Field-based Segmentation for Distributed CNN Inference
Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network.
We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z) - Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel.
We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z) - Efficient and Robust Mixed-Integer Optimization Methods for Training
Binarized Deep Neural Networks [0.07614628596146598]
We study deep neural networks with binary activation functions and continuous or integer weights (BDNN)
We show that the BDNN can be reformulated as a mixed-integer linear program with bounded weight space which can be solved to global optimality by classical mixed-integer programming solvers.
For the first time a robust model is presented which enforces robustness of the BDNN during training.
arXiv Detail & Related papers (2021-10-21T18:02:58Z) - Sub-bit Neural Networks: Learning to Compress and Accelerate Binary
Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs.
SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space.
Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z) - Dynamic DNN Decomposition for Lossless Synergistic Inference [0.9549013615433989]
Deep neural networks (DNNs) sustain high performance in today's data processing applications.
We propose D3, a dynamic DNN decomposition system for synergistic inference without precision loss.
D3 outperforms the state-of-the-art counterparts up to 3.4 times in end-to-end DNN inference time and reduces backbone network communication overhead up to 3.68 times.
arXiv Detail & Related papers (2021-01-15T03:18:53Z) - TASO: Time and Space Optimization for Memory-Constrained DNN Inference [5.023660118588569]
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices.
We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers.
arXiv Detail & Related papers (2020-05-21T15:08:06Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.