Related papers: A Study on the Intersection of GPU Utilization and CNN Inference

A Study on the Intersection of GPU Utilization and CNN Inference

URL: http://arxiv.org/abs/2212.07936v1
Date: Thu, 15 Dec 2022 16:11:40 GMT
Title: A Study on the Intersection of GPU Utilization and CNN Inference
Authors: Jack Kosaian, Amar Phanishayee
Abstract summary: We show that there is room to improve the inference-time GPU utilization of convolutional neural network (CNN) inference. Our study makes the case that there is room to improve the inference-time GPU utilization of CNNs and that knowledge of GPU utilization has the potential to benefit even applications that do not target utilization itself.
Score: 8.084016058894779
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: There has been significant progress in developing neural network architectures that both achieve high predictive performance and that also achieve high application-level inference throughput (e.g., frames per second). Another metric of increasing importance is GPU utilization during inference: the measurement of how well a deployed neural network uses the computational capabilities of the GPU on which it runs. Achieving high GPU utilization is critical to increasing application-level throughput and ensuring a good return on investment for deploying GPUs. This paper analyzes the GPU utilization of convolutional neural network (CNN) inference. We first survey the GPU utilization of CNNs to show that there is room to improve the GPU utilization of many of these CNNs. We then investigate the GPU utilization of networks within a neural architecture search (NAS) search space, and explore how using GPU utilization as a metric could potentially be used to accelerate NAS itself. Our study makes the case that there is room to improve the inference-time GPU utilization of CNNs and that knowledge of GPU utilization has the potential to benefit even applications that do not target utilization itself. We hope that the results of this study will spur future innovation in designing GPU-efficient neural networks.

Related papers

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals [58.83169560132308]
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks.
arXiv Detail & Related papers (2024-07-18T17:59:01Z)
Benchmarking GPUs on SVBRDF Extractor Model [0.0]
In this work, we try to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256) In this work, we tried to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256)
arXiv Detail & Related papers (2023-10-19T17:09:06Z)
Transferability of Convolutional Neural Networks in Stationary Learning Tasks [96.00428692404354]
We introduce a novel framework for efficient training of convolutional neural networks (CNNs) for large-scale spatial problems. We show that a CNN trained on small windows of such signals achieves a nearly performance on much larger windows without retraining. Our results show that the CNN is able to tackle problems with many hundreds of agents after being trained with fewer than ten.
arXiv Detail & Related papers (2023-07-21T13:51:45Z)
Quiver: Supporting GPUs for Low-Latency, High-Throughput GNN Serving with Workload Awareness [4.8412870364335925]
Quiver is a distributed GPU-based GNN serving system with low-latency and high- throughput. We show that Quiver achieves up to 35 times lower latency with an 8 times higher throughput compared to state-of-the-art GNN approaches.
arXiv Detail & Related papers (2023-05-18T10:34:23Z)
Architectural Implications of Embedding Dimension during GCN on CPU and GPU [6.650945912906685]
Graph Convolutional Networks (GCNs) are a widely used type of GNN for transductive graph learning problems. GCN is a challenging algorithm from an architecture perspective due to inherent sparsity, low data reuse, and massive memory capacity requirements.
arXiv Detail & Related papers (2022-12-01T19:23:12Z)
Survey on Large Scale Neural Network Training [48.424512364338746]
Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training. This survey provides a systematic overview of the approaches that enable more efficient DNNs training.
arXiv Detail & Related papers (2022-02-21T18:48:02Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)
L2PF -- Learning to Prune Faster [57.32153461504626]
We present a multi-task, try-and-learn method, discretely learning redundant filters of the CNN and a continuous action of how long the layers have to be fine-tuned. For ResNet20, we have achieved a compression ratio of 3.84 x with minimal accuracy degradation. Compared to the state-of-the-art pruning method, we reduced the GPU hours by 1.71 x.
arXiv Detail & Related papers (2021-01-07T18:13:37Z)
At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation [24.824295164938604]
This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function.
arXiv Detail & Related papers (2020-07-28T12:09:43Z)
Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces. We train and validate our approach directly on the Intel NNP-I chip for inference. We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z)
Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z)
Neural Architecture Design for GPU-Efficient Networks [27.07089149328155]
We propose a general principle for designing GPU-efficient networks based on extensive empirical studies. Based on the proposed framework, we design a family of GPU-Efficient Networks, or GENets in short. While achieving $geq 81.3%$ top-1 accuracy on ImageNet, GENet is up to $6.4$ times faster than EfficienNet on GPU.
arXiv Detail & Related papers (2020-06-24T22:42:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.