An efficient and flexible inference system for serving heterogeneous
ensembles of deep neural networks
- URL: http://arxiv.org/abs/2208.14049v1
- Date: Tue, 30 Aug 2022 08:05:43 GMT
- Title: An efficient and flexible inference system for serving heterogeneous
ensembles of deep neural networks
- Authors: Pierrick Pochelu, Serge G. Petiton, Bruno Conche
- Abstract summary: Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive.
We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensembles of Deep Neural Networks (DNNs) have achieved qualitative
predictions but they are computing and memory intensive. Therefore, the demand
is growing to make them answer a heavy workload of requests with available
computational resources. Unlike recent initiatives on inference servers and
inference frameworks, which focus on the prediction of single DNNs, we propose
a new software layer to serve with flexibility and efficiency ensembles of
DNNs.
Our inference system is designed with several technical innovations. First,
we propose a novel procedure to find a good allocation matrix between devices
(CPUs or GPUs) and DNN instances. It runs successively a worst-fit to allocate
DNNs into the memory devices and a greedy algorithm to optimize allocation
settings and speed up the ensemble. Second, we design the inference system
based on multiple processes to run asynchronously: batching, prediction, and
the combination rule with an efficient internal communication scheme to avoid
overhead.
Experiments show the flexibility and efficiency under extreme scenarios: It
successes to serve an ensemble of 12 heavy DNNs into 4 GPUs and at the
opposite, one single DNN multi-threaded into 16 GPUs. It also outperforms the
simple baseline consisting of optimizing the batch size of DNNs by a speedup up
to 2.7X on the image classification task.
Related papers
- DiviML: A Module-based Heuristic for Mapping Neural Networks onto
Heterogeneous Platforms [5.970091958678456]
We develop an approach for compiler-level partitioning of deep neural networks (DNNs) onto multiple interconnected hardware devices.
Our scheduler integrates both an exact solver, through a mixed integer linear programming (MILP) formulation, and a modularity-based runtime.
We show how we can extend our framework to schedule large language models across multiple heterogeneous servers.
arXiv Detail & Related papers (2023-07-31T19:46:49Z) - SENSEi: Input-Sensitive Compilation for Accelerating GNNs [7.527596018706567]
We propose SENSEi, a system that exposes different sparse and dense matrix primitive compositions based on different matrix re-associations of GNN computations.
SENSEi executes in two stages: (1) an offline compilation stage that enumerates all valid re-associations leading to different sparse-dense matrix compositions and uses input-oblivious pruning techniques to prune away clearly unprofitable candidates.
On a wide range of configurations, SENSEi achieves speedups of up to $2.012times$ and $1.85times$ on graph convolutional networks and up to $6.294times$ and $16.274
arXiv Detail & Related papers (2023-06-27T02:24:05Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Receptive Field-based Segmentation for Distributed CNN Inference
Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network.
We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z) - Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel.
We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z) - Efficient and Robust Mixed-Integer Optimization Methods for Training
Binarized Deep Neural Networks [0.07614628596146598]
We study deep neural networks with binary activation functions and continuous or integer weights (BDNN)
We show that the BDNN can be reformulated as a mixed-integer linear program with bounded weight space which can be solved to global optimality by classical mixed-integer programming solvers.
For the first time a robust model is presented which enforces robustness of the BDNN during training.
arXiv Detail & Related papers (2021-10-21T18:02:58Z) - Sub-bit Neural Networks: Learning to Compress and Accelerate Binary
Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs.
SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space.
Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z) - Dynamic DNN Decomposition for Lossless Synergistic Inference [0.9549013615433989]
Deep neural networks (DNNs) sustain high performance in today's data processing applications.
We propose D3, a dynamic DNN decomposition system for synergistic inference without precision loss.
D3 outperforms the state-of-the-art counterparts up to 3.4 times in end-to-end DNN inference time and reduces backbone network communication overhead up to 3.68 times.
arXiv Detail & Related papers (2021-01-15T03:18:53Z) - Binary Graph Neural Networks [69.51765073772226]
Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data.
In this paper, we present and evaluate different strategies for the binarization of graph neural networks.
We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks.
arXiv Detail & Related papers (2020-12-31T18:48:58Z) - TASO: Time and Space Optimization for Memory-Constrained DNN Inference [5.023660118588569]
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices.
We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers.
arXiv Detail & Related papers (2020-05-21T15:08:06Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.