Related papers: Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications

Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications

URL: http://arxiv.org/abs/2503.13343v1
Date: Mon, 17 Mar 2025 16:21:48 GMT
Title: Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications
Authors: Andre Merzky, Mikhail Titov, Matteo Turilli, Ozgur Kilic, Tianle Wang, Shantenu Jha,
Abstract summary: Hybrid combining traditional HPC and novel ML methodologies are transforming scientific computing.<n>This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out- HPC.<n>Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads.
Score: 2.0999841017238063
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling across local and remote platforms. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads. This lays the foundation for prototyping three representative data-driven workflow applications and executing them at scale on leadership-class HPC platforms.

Related papers

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs) StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z)
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.<n>From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.<n>From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z)
Exascale Workflow Applications and Middleware: An ExaWorks Retrospective [3.4423220997316593]
We present the ExaWorks project, which addresses the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We developed a workflow Software Development Toolkit (SDK), a job management abstraction API, and PSI/J, a minimal interface for submitting and monitoring jobs. We discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of sustainably at the exascale.
arXiv Detail & Related papers (2024-11-16T00:10:53Z)
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems [80.69865295743149]
This work attempts to study using LLM-based agents to design collaborative AI systems autonomously.<n>Based on ComfyBench, we develop ComfyAgent, a framework that empowers agents to autonomously design collaborative AI systems by generating.<n>While ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks.
arXiv Detail & Related papers (2024-09-02T17:44:10Z)
Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at Scale [1.474723404975345]
Hydra is an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms. We present Hydra an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms.
arXiv Detail & Related papers (2024-07-16T17:59:46Z)
HPC-GPT: Integrating Large Language Model for High-Performance Computing [3.8078849170829407]
We propose HPC-GPT, a novel LLaMA-based model that has been supervised fine-tuning using generated QA (Question-Answer) instances for the HPC domain. To evaluate its effectiveness, we concentrate on two HPC tasks: managing AI models and datasets for HPC, and data race detection. Our experiments on open-source benchmarks yield extensive results, underscoring HPC-GPT's potential to bridge the performance gap between LLMs and HPC-specific tasks.
arXiv Detail & Related papers (2023-10-03T01:34:55Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
AI-coupled HPC Workflows [1.5469452301122175]
Introduction to AI/ML models into the traditional HPC has been an enabler of highly accurate modeling. Various modes of integrating AI/ML models to HPC computations, resulting in diverse types of AI-coupled HPC.
arXiv Detail & Related papers (2022-08-24T19:16:43Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning [61.28547338576706]
Population-based multi-agent reinforcement learning (PB-MARL) refers to the series of methods nested with reinforcement learning (RL) algorithms. We present MALib, a scalable and efficient computing framework for PB-MARL.
arXiv Detail & Related papers (2021-06-05T03:27:08Z)
Integrating Deep Learning in Domain Sciences at Exascale [2.241545093375334]
We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently. We propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems. We present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications with AI.
arXiv Detail & Related papers (2020-11-23T03:09:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.