Related papers: Hierarchical Roofline Performance Analysis for Deep Learning Applications

Hierarchical Roofline Performance Analysis for Deep Learning Applications

URL: http://arxiv.org/abs/2009.05257v4
Date: Wed, 25 Nov 2020 02:52:41 GMT
Title: Hierarchical Roofline Performance Analysis for Deep Learning Applications
Authors: Charlene Yang, Yunsong Wang, Steven Farrell, Thorsten Kurth, Samuel Williams
Abstract summary: This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the Empirical Roofline Toolkit for broader support of a range of data precisions and Core support and introduces a Nsight Compute based method to accurately collect application performance information.
Score: 0.06999740786886534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the Empirical Roofline Toolkit for broader support of a range of data precisions and Tensor Core support and introduces a Nsight Compute based method to accurately collect application performance information. This methodology allows for automated machine characterization and application characterization for Roofline analysis across the entire memory hierarchy on NVIDIA GPUs, and it is validated by a complex deep learning application used for climate image segmentation. We use two versions of the code, in TensorFlow and PyTorch respectively, to demonstrate the use and effectiveness of this methodology. We highlight how the application utilizes the compute and memory capabilities on the GPU and how the implementation and performance differ in two deep learning frameworks.

Related papers

FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness [0.0]
Methods like FlashAttention have achieved a x6 performance improvement over native PyTorch by avoiding unnecessary data transfers. This paper extends Neural Circuit Diagrams for deep learning models to consider resource usage and the distribution of tasks across a GPU hierarchy. We develop a methodology for representing intermediate-level pseudocode with diagrams, allowing hardware-aware algorithms to be derived step-by-step.
arXiv Detail & Related papers (2024-12-04T13:52:04Z)
DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads [5.987963635879264]
This paper introduces DeepContext, a novel profiler that links program contexts across high-level Python code, deep learning frameworks, underlying libraries written in C/C++, as well as device code executed on GPUs. DeepContext incorporates measurements of both coarse- and fine-grained performance metrics for major deep learning frameworks, such as PyTorch and JAX. In addition, DeepContext integrates a novel GUI that allows users to quickly identify hotpots and an innovative automated performance analyzer that suggests users with potential optimizations based on performance metrics and program context.
arXiv Detail & Related papers (2024-11-05T04:15:26Z)
Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies [0.0]
The study introduces an analytical model-driven tuning methodology and a Machine Learning (ML)-based tuning methodology. We evaluate the performance of the two tuning methodologies for different parallel prefix implementations of the BPLG library in an NVIDIA Jetson system.
arXiv Detail & Related papers (2023-10-24T22:09:03Z)
Benchmarking Node Outlier Detection on Graphs [90.29966986023403]
Graph outlier detection is an emerging but crucial machine learning task with numerous applications. We present the first comprehensive unsupervised node outlier detection benchmark for graphs called UNOD.
arXiv Detail & Related papers (2022-06-21T01:46:38Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
UPDesc: Unsupervised Point Descriptor Learning for Robust Registration [54.95201961399334]
UPDesc is an unsupervised method to learn point descriptors for robust point cloud registration. We show that our learned descriptors yield superior performance over existing unsupervised methods.
arXiv Detail & Related papers (2021-08-05T17:11:08Z)
Densely Nested Top-Down Flows for Salient Object Detection [137.74130900326833]
This paper revisits the role of top-down modeling in salient object detection. It designs a novel densely nested top-down flows (DNTDF)-based framework. In every stage of DNTDF, features from higher levels are read in via the progressive compression shortcut paths (PCSP)
arXiv Detail & Related papers (2021-02-18T03:14:02Z)
Comparative Code Structure Analysis using Deep Learning for Performance Prediction [18.226950022938954]
This paper aims to assess the feasibility of using purely static information (e.g., abstract syntax tree or AST) of applications to predict performance change based on the change in code structure. Our evaluations of several deep embedding learning methods demonstrate that tree-based Long Short-Term Memory (LSTM) models can leverage the hierarchical structure of source-code to discover latent representations and achieve up to 84% (individual problem) and 73% (combined dataset with multiple of problems) accuracy in predicting the change in performance.
arXiv Detail & Related papers (2021-02-12T16:59:12Z)
A Holistically-Guided Decoder for Deep Representation Learning with Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps. We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z)
Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning [73.75457731689858]
We develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$2$F) for SISR. Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods.
arXiv Detail & Related papers (2020-11-13T06:01:46Z)
Time-Based Roofline for Deep Learning Performance Analysis [2.547058931949976]
Roofline-based approach to performance analysis to facilitate the optimization of deep learning applications. We take two sets of representative kernels, 2D convolution and long short-term memory, to validate and demonstrate the use of this new approach. Compared to the common ad-hoc approach, this study helps form a more systematic way to analyze code performance.
arXiv Detail & Related papers (2020-09-09T23:29:04Z)
Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach [16.702537371391053]
This article presents an automatic approach to derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. Compared to the single-stream version, our approach achieves a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively.
arXiv Detail & Related papers (2020-03-05T21:18:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.