Accelerating Multi-Model Inference by Merging DNNs of Different Weights
- URL: http://arxiv.org/abs/2009.13062v1
- Date: Mon, 28 Sep 2020 04:33:09 GMT
- Title: Accelerating Multi-Model Inference by Merging DNNs of Different Weights
- Authors: Joo Seong Jeong, Soojeong Kim, Gyeong-In Yu, Yunseong Lee, Byung-Gon
Chun
- Abstract summary: We propose NetFuse, a technique of merging multiple DNN models that share the same architecture but have different weights and different inputs.
Experiments on ResNet-50, ResNeXt-50, BERT, and XLNet show that NetFuse can speed up DNN inference time up to 3.6x on a NVIDIA V100 GPU.
- Score: 3.4123736336071864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standardized DNN models that have been proved to perform well on machine
learning tasks are widely used and often adopted as-is to solve downstream
tasks, forming the transfer learning paradigm. However, when serving multiple
instances of such DNN models from a cluster of GPU servers, existing techniques
to improve GPU utilization such as batching are inapplicable because models
often do not share weights due to fine-tuning. We propose NetFuse, a technique
of merging multiple DNN models that share the same architecture but have
different weights and different inputs. NetFuse is made possible by replacing
operations with more general counterparts that allow a set of weights to be
associated with only a certain set of inputs. Experiments on ResNet-50,
ResNeXt-50, BERT, and XLNet show that NetFuse can speed up DNN inference time
up to 3.6x on a NVIDIA V100 GPU, and up to 3.0x on a TITAN Xp GPU when merging
32 model instances, while only using up a small additional amount of GPU
memory.
Related papers
- MatchNAS: Optimizing Edge AI in Sparse-Label Data Contexts via
Automating Deep Neural Network Porting for Mobile Deployment [54.77943671991863]
MatchNAS is a novel scheme for porting Deep Neural Networks to mobile devices.
We optimise a large network family using both labelled and unlabelled data.
We then automatically search for tailored networks for different hardware platforms.
arXiv Detail & Related papers (2024-02-21T04:43:12Z) - DNNShifter: An Efficient DNN Pruning System for Edge Computing [1.853502789996996]
Deep neural networks (DNNs) underpin many machine learning applications.
Production quality DNN models achieve high inference accuracy by training millions of DNN parameters which has a significant resource footprint.
This presents a challenge for resources operating at the extreme edge of the network, such as mobile and embedded devices that have limited computational and memory resources.
Existing pruning methods are unable to provide similar quality models compared to their unpruned counterparts without significant time costs and overheads or are limited to offline use cases.
Our work rapidly derives suitable model variants while maintaining the accuracy of the original model. The model variants can be swapped quickly when system
arXiv Detail & Related papers (2023-09-13T14:05:50Z) - Harmony: Overcoming the hurdles of GPU memory capacity to train massive
DNN models on commodity servers [13.620650014358413]
Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade.
One of the main challenges for researchers who might have access to only limited resources is limited memory capacity compared to model size.
arXiv Detail & Related papers (2022-02-02T22:16:27Z) - Network Augmentation for Tiny Deep Learning [73.57192520534585]
We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks.
We demonstrate the effectiveness of NetAug on image classification and object detection.
arXiv Detail & Related papers (2021-10-17T18:48:41Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Dynamic-OFA: Runtime DNN Architecture Switching for Performance Scaling
on Heterogeneous Embedded Platforms [3.3197851873862385]
This paper proposes Dynamic-OFA, a novel dynamic DNN approach for state-of-the-art platform-aware NAS models (i.e. Once-for-all network (OFA))
Compared to the state-of-the-art, our experimental results using ImageNet on a Jetson Xavier NX show that the approach is up to 3.5x faster for similar ImageNet Top-1 accuracy.
arXiv Detail & Related papers (2021-05-08T05:10:53Z) - When deep learning models on GPU can be accelerated by taking advantage
of unstructured sparsity [0.0]
This paper is focused on the improvement the efficiency of the sparse convolutional neural networks (CNNs) layers on graphic processing units ( GPU)
The modern CNN models need megabytes of coefficients and needed millions MAC operations to perform convolution.
We show when is worth using a direct sparse operation to speed-up the calculation of the convolution layers.
arXiv Detail & Related papers (2020-11-12T10:13:48Z) - ShiftAddNet: A Hardware-Inspired Deep Network [87.18216601210763]
ShiftAddNet is an energy-efficient multiplication-less deep neural network.
It leads to both energy-efficient inference and training, without compromising expressive capacity.
ShiftAddNet aggressively reduces over 80% hardware-quantified energy cost of DNNs training and inference, while offering comparable or better accuracies.
arXiv Detail & Related papers (2020-10-24T05:09:14Z) - Spatial Sharing of GPU for Autotuning DNN models [4.63732827131233]
Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPU.
We present many techniques to maximize resource utilization and improve tuning performance.
arXiv Detail & Related papers (2020-08-08T21:27:38Z) - Neural Network Compression Framework for fast model inference [59.65531492759006]
We present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF)
It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization.
The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code.
arXiv Detail & Related papers (2020-02-20T11:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.