Related papers: Exascale Workflow Applications and Middleware: An ExaWorks Retrospective

Exascale Workflow Applications and Middleware: An ExaWorks Retrospective

URL: http://arxiv.org/abs/2411.10637v1
Date: Sat, 16 Nov 2024 00:10:53 GMT
Title: Exascale Workflow Applications and Middleware: An ExaWorks Retrospective
Authors: Aymen Alsaadi, Mihael Hategan-Marandiuc, Ketan Maheshwari, Andre Merzky, Mikhail Titov, Matteo Turilli, Andreas Wilke, Justin M. Wozniak, Kyle Chard, Rafael Ferreira da Silva, Shantenu Jha, Daniel Laney,
Abstract summary: We present the ExaWorks project, which addresses the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We developed a workflow Software Development Toolkit (SDK), a job management abstraction API, and PSI/J, a minimal interface for submitting and monitoring jobs. We discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of sustainably at the exascale.
Score: 3.4423220997316593
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which addresses many of these challenges. We developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms. ExaWorks also developed PSI/J, a job management abstraction API, to simplify the construction of portable software components and applications that can be used over various HPC schedulers. The PSI/J API is a minimal interface for submitting and monitoring jobs and their execution state across multiple and commonly used HPC schedulers. We also describe several leading and innovative workflow examples of ExaWorks tools used on DOE leadership platforms. Furthermore, we discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of workflows sustainably at the exascale.

Related papers

Towards Conversational AI for Human-Machine Collaborative MLOps [0.17152709285783643]
This paper presents a Large Language Model (LLM) based conversational agent system designed to enhance human-machine collaboration in Machine Learning Operations (MLOps) We introduce the Swarm Agent, an architecture that integrates specialized agents to create and manage ML through natural language interactions. The paper describes the architecture, implementation details, and demonstrates how this conversational MLOps assistant reduces complexity and lowers to entry for users across diverse technical skill levels.
arXiv Detail & Related papers (2025-04-16T20:28:50Z)
Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications [2.0999841017238063]
Hybrid combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out- HPC. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads.
arXiv Detail & Related papers (2025-03-17T16:21:48Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration. It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories. LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z)
GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z)
ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies [3.1805622006446397]
Heterogeneous scientific discovery increasingly requires executing on high-performance computing platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK) The SDK is a collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms.
arXiv Detail & Related papers (2024-07-23T17:00:09Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z)
Leveraging Large Language Models to Build and Execute Computational Workflows [40.572754656757475]
This paper explores how these emerging capabilities can be harnessed to facilitate complex scientific research. We present initial findings from our attempt to integrate Phyloflow with OpenAI's function-calling API, and outline a strategy for developing a comprehensive workflow management system.
arXiv Detail & Related papers (2023-12-12T20:17:13Z)
Large Language Models to the Rescue: Reducing the Complexity in Scientific Workflow Development Using ChatGPT [11.410608233274942]
Scientific systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets. However, implementing is difficult due to the involvement of many blackbox tools and the deep infrastructure stack necessary for their execution. We investigate the efficiency of Large Language Models, specifically ChatGPT, to support users when dealing with scientific domains.
arXiv Detail & Related papers (2023-11-03T10:28:53Z)
HPC-Coder: Modeling Parallel Programs using Large Language Models [2.3101915391170573]
We show how large language models can be applied to tasks specific to high performance and scientific codes. We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models. In our experiments, we show that this model can auto-complete HPC functions where generic models cannot.
arXiv Detail & Related papers (2023-06-29T19:44:55Z)
Composing Complex and Hybrid AI Solutions [52.00820391621739]
We describe an extension of the Acumos system towards enabling the above features for general AI applications. Our extensions include support for more generic components with gRPC/Protobuf interfaces. We provide examples of deployable solutions and their interfaces.
arXiv Detail & Related papers (2022-02-25T08:57:06Z)
YMIR: A Rapid Data-centric Development Platform for Vision Applications [82.67319997259622]
This paper introduces an open source platform for rapid development of computer vision applications. The platform puts the efficient data development at the center of the machine learning development process.
arXiv Detail & Related papers (2021-11-19T05:02:55Z)
Reproducible Performance Optimization of Complex Applications on the Edge-to-Cloud Continuum [55.6313942302582]
We propose a methodology to support the optimization of real-life applications on the Edge-to-Cloud Continuum. Our approach relies on a rigorous analysis of possible configurations in a controlled testbed environment to understand their behaviour. Our methodology can be generalized to other applications in the Edge-to-Cloud Continuum.
arXiv Detail & Related papers (2021-08-04T07:35:14Z)
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIs [0.2538209532048866]
This article provides the motivation and overview of the Collective Knowledge framework (CK or cKnowledge) The CK concept is to decompose research projects into reusable components that encapsulate research artifacts. The long-term goal is to accelerate innovation by connecting researchers and practitioners to share and reuse all their knowledge.
arXiv Detail & Related papers (2020-11-02T17:42:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.