Related papers: AI-coupled HPC Workflows

AI-coupled HPC Workflows

URL: http://arxiv.org/abs/2208.11745v1
Date: Wed, 24 Aug 2022 19:16:43 GMT
Title: AI-coupled HPC Workflows
Authors: Shantenu Jha, Vincent R. Pascuzzi, Matteo Turilli
Abstract summary: Introduction to AI/ML models into the traditional HPC has been an enabler of highly accurate modeling. Various modes of integrating AI/ML models to HPC computations, resulting in diverse types of AI-coupled HPC.
Score: 1.5469452301122175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Increasingly, scientific discovery requires sophisticated and scalable workflows. Workflows have become the ``new applications,'' wherein multi-scale computing campaigns comprise multiple and heterogeneous executable tasks. In particular, the introduction of AI/ML models into the traditional HPC workflows has been an enabler of highly accurate modeling, typically reducing computational needs compared to traditional methods. This chapter discusses various modes of integrating AI/ML models to HPC computations, resulting in diverse types of AI-coupled HPC workflows. The increasing need of coupling AI/ML and HPC across scientific domains is motivated, and then exemplified by a number of production-grade use cases for each mode. We additionally discuss the primary challenges of extreme-scale AI-coupled HPC campaigns -- task heterogeneity, adaptivity, performance -- and several framework and middleware solutions which aim to address them. While both HPC workflow and AI/ML computing paradigms are independently effective, we highlight how their integration, and ultimate convergence, is leading to significant improvements in scientific performance across a range of domains, ultimately resulting in scientific explorations otherwise unattainable.

Related papers

Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic Collaboration [81.45763823762682]
This work aims to bridge the gap by investigating the problem of data synthesis through multi-agent sampling. We introduce Tree Search-based Orchestrated Agents(TOA), where the workflow evolves iteratively during the sequential sampling process. Our experiments on alignment, machine translation, and mathematical reasoning demonstrate that multi-agent sampling significantly outperforms single-agent sampling as inference compute scales.
arXiv Detail & Related papers (2024-12-22T15:16:44Z)
GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z)
Employing Artificial Intelligence to Steer Exascale Workflows with Colmena [37.42013214123005]
Colmena allows scientists to define how their application should respond to events as a series of cooperative agents. We describe the challenges we overcame while deploying applications on exascale systems, and the science we have enhanced through AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.
arXiv Detail & Related papers (2024-08-26T17:21:19Z)
Using AI libraries for Incompressible Computational Fluid Dynamics [0.7734726150561089]
We present a novel methodology to bring the power of both AI software and hardware into the field of numerical modelling. We use the proposed methodology to solve the advection-diffusion equation, the non-linear Burgers equation and incompressible flow past a bluff body.
arXiv Detail & Related papers (2024-02-27T22:00:50Z)
Machine Learning Insides OptVerse AI Solver: Design Principles and Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z)
HPC-GPT: Integrating Large Language Model for High-Performance Computing [3.8078849170829407]
We propose HPC-GPT, a novel LLaMA-based model that has been supervised fine-tuning using generated QA (Question-Answer) instances for the HPC domain. To evaluate its effectiveness, we concentrate on two HPC tasks: managing AI models and datasets for HPC, and data race detection. Our experiments on open-source benchmarks yield extensive results, underscoring HPC-GPT's potential to bridge the performance gap between LLMs and HPC-specific tasks.
arXiv Detail & Related papers (2023-10-03T01:34:55Z)
Meta-Learning for Airflow Simulations with Graph Neural Networks [3.52359746858894]
We present a meta-learning approach to enhance the performance of learned models on out-of-distribution (OoD) samples. Specifically, we set the airflow simulation in CFD over various airfoils as a meta-learning problem, where each set of examples defined on a single airfoil shape is treated as a separate task. We experimentally demonstrate the efficiency of the proposed approach for improving the OoD generalization performance of learned models.
arXiv Detail & Related papers (2023-06-18T19:25:13Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
Multi-fidelity Hierarchical Neural Processes [79.0284780825048]
Multi-fidelity surrogate modeling reduces the computational cost by fusing different simulation outputs. We propose Multi-fidelity Hierarchical Neural Processes (MF-HNP), a unified neural latent variable model for multi-fidelity surrogate modeling. We evaluate MF-HNP on epidemiology and climate modeling tasks, achieving competitive performance in terms of accuracy and uncertainty estimation.
arXiv Detail & Related papers (2022-06-10T04:54:13Z)
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models. Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)
Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC. We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)
Integrating Deep Learning in Domain Sciences at Exascale [2.241545093375334]
We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently. We propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems. We present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications with AI.
arXiv Detail & Related papers (2020-11-23T03:09:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.