MLPerfTM HPC: A Holistic Benchmark Suite for Scientific Machine Learning
on HPC Systems
- URL: http://arxiv.org/abs/2110.11466v1
- Date: Thu, 21 Oct 2021 20:30:12 GMT
- Title: MLPerfTM HPC: A Holistic Benchmark Suite for Scientific Machine Learning
on HPC Systems
- Authors: Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr
Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter
Mattson, Dawei Mu, Amit Ruhela, Kento Sato, Koichi Shirahata, Tsuguchika
Tabaru, Aristeidis Tsaris, Jan Balewski, Ben Cumming, Takumi Danjo, Jens
Domke, Takaaki Fukai, Naoto Fukumoto, Tatsuya Fukushi, Balazs Gerofi, Takumi
Honda, Toshiyuki Imamura, Akihiko Kasagi, Kentaro Kawakami, Shuhei Kudo,
Akiyoshi Kuroda, Maxime Martinasso, Satoshi Matsuoka, Henrique Mendonc,
Kazuki Minami, Prabhat Ram, Takashi Sawada, Mallikarjun Shankar, Tom St.
John, Akihiro Tabuchi, Venkatram Vishwanath, Mohamed Wahib, Masafumi
Yamazaki, Junqi Yin
- Abstract summary: We introduceerf HPC, a benchmark suite of scientific machine learning training applications driven by the MLCommonsTM Association.
We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance.
We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior.
- Score: 32.621917787044396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific communities are increasingly adopting machine learning and deep
learning models in their applications to accelerate scientific insights. High
performance computing systems are pushing the frontiers of performance with a
rich diversity of hardware resources and massive scale-out capabilities. There
is a critical need to understand fair and effective benchmarking of machine
learning applications that are representative of real-world scientific use
cases. MLPerfTM is a community-driven standard to benchmark machine learning
workloads, focusing on end-to-end performance metrics. In this paper, we
introduce MLPerf HPC, a benchmark suite of largescale scientific machine
learning training applications, driven by the MLCommonsTM Association. We
present the results from the first submission round including a diverse set of
some of the world's largest HPC systems. We develop a systematic framework for
their joint analysis and compare them in terms of data staging, algorithmic
convergence, and compute performance. As a result, we gain a quantitative
understanding of optimizations on different subsystems such as staging and
on-node loading of data, compute-unit utilization, and communication scheduling
enabling overall > 10x (end-to-end) performance improvements through system
scaling. Notably, our analysis shows a scale-dependent interplay between the
dataset size, a system's memory hierarchy, and training convergence that
underlines the importance of near compute storage. To overcome the
data-parallel scalability challenge at large batch sizes, we discuss specific
learning techniques and hybrid data-and-model parallelism that are effective on
large systems. We conclude by characterizing each benchmark with respect to
low-level memory, I/O, and network behavior to parameterize extended roofline
performance models in future rounds.
Related papers
- Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference [2.2231908139555734]
We propose a general performance modeling methodology and workload analysis of distributed LLM training and inference.
We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA)
arXiv Detail & Related papers (2024-07-19T19:49:05Z) - PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload.
It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance.
processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads.
Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Machine Learning Training on a Real Processing-in-Memory System [9.286176889576996]
Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound.
Memory-centric computing systems with processing-in-memory capabilities can alleviate this data movement bottleneck.
Our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture.
arXiv Detail & Related papers (2022-06-13T10:20:23Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Scalable Graph Embedding LearningOn A Single GPU [18.142879223260785]
We introduce a hybrid CPU-GPU framework that addresses the challenges of learning embedding of large-scale graphs.
We show that our system can scale training to datasets with an order of magnitude greater than a single machine's total memory capacity.
arXiv Detail & Related papers (2021-10-13T19:09:33Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - Model-Based Deep Learning [155.063817656602]
Signal processing, communications, and control have traditionally relied on classical statistical modeling techniques.
Deep neural networks (DNNs) use generic architectures which learn to operate from data, and demonstrate excellent performance.
We are interested in hybrid techniques that combine principled mathematical models with data-driven systems to benefit from the advantages of both approaches.
arXiv Detail & Related papers (2020-12-15T16:29:49Z) - A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions.
Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data.
Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.