Exascale Workflow Applications and Middleware: An ExaWorks Retrospective
- URL: http://arxiv.org/abs/2411.10637v1
- Date: Sat, 16 Nov 2024 00:10:53 GMT
- Title: Exascale Workflow Applications and Middleware: An ExaWorks Retrospective
- Authors: Aymen Alsaadi, Mihael Hategan-Marandiuc, Ketan Maheshwari, Andre Merzky, Mikhail Titov, Matteo Turilli, Andreas Wilke, Justin M. Wozniak, Kyle Chard, Rafael Ferreira da Silva, Shantenu Jha, Daniel Laney,
- Abstract summary: We present the ExaWorks project, which addresses the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms.
We developed a workflow Software Development Toolkit (SDK), a job management abstraction API, and PSI/J, a minimal interface for submitting and monitoring jobs.
We discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of sustainably at the exascale.
- Score: 3.4423220997316593
- License:
- Abstract: Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which addresses many of these challenges. We developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms. ExaWorks also developed PSI/J, a job management abstraction API, to simplify the construction of portable software components and applications that can be used over various HPC schedulers. The PSI/J API is a minimal interface for submitting and monitoring jobs and their execution state across multiple and commonly used HPC schedulers. We also describe several leading and innovative workflow examples of ExaWorks tools used on DOE leadership platforms. Furthermore, we discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of workflows sustainably at the exascale.
Related papers
- PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.
From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.
From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z) - HyperGraphOS: A Modern Meta-Operating System for the Scientific and Engineering Domains [1.4469725791865982]
This paper presents HyperGraphOS, a significant innovation in the domain of operating systems.
It aims to combine model-based engineering, graph modeling, data containers, and documents, along with tools for handling computational elements.
arXiv Detail & Related papers (2024-12-13T15:18:39Z) - WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration.
It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories.
LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z) - ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies [3.1805622006446397]
Heterogeneous scientific discovery increasingly requires executing on high-performance computing platforms.
We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK)
The SDK is a collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms.
arXiv Detail & Related papers (2024-07-23T17:00:09Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks.
SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs.
We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z) - Leveraging Large Language Models to Build and Execute Computational
Workflows [40.572754656757475]
This paper explores how these emerging capabilities can be harnessed to facilitate complex scientific research.
We present initial findings from our attempt to integrate Phyloflow with OpenAI's function-calling API, and outline a strategy for developing a comprehensive workflow management system.
arXiv Detail & Related papers (2023-12-12T20:17:13Z) - Large Language Models to the Rescue: Reducing the Complexity in
Scientific Workflow Development Using ChatGPT [11.410608233274942]
Scientific systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets.
However, implementing is difficult due to the involvement of many blackbox tools and the deep infrastructure stack necessary for their execution.
We investigate the efficiency of Large Language Models, specifically ChatGPT, to support users when dealing with scientific domains.
arXiv Detail & Related papers (2023-11-03T10:28:53Z) - Composing Complex and Hybrid AI Solutions [52.00820391621739]
We describe an extension of the Acumos system towards enabling the above features for general AI applications.
Our extensions include support for more generic components with gRPC/Protobuf interfaces.
We provide examples of deployable solutions and their interfaces.
arXiv Detail & Related papers (2022-02-25T08:57:06Z) - YMIR: A Rapid Data-centric Development Platform for Vision Applications [82.67319997259622]
This paper introduces an open source platform for rapid development of computer vision applications.
The platform puts the efficient data development at the center of the machine learning development process.
arXiv Detail & Related papers (2021-11-19T05:02:55Z) - Collective Knowledge: organizing research projects as a database of
reusable components and portable workflows with common APIs [0.2538209532048866]
This article provides the motivation and overview of the Collective Knowledge framework (CK or cKnowledge)
The CK concept is to decompose research projects into reusable components that encapsulate research artifacts.
The long-term goal is to accelerate innovation by connecting researchers and practitioners to share and reuse all their knowledge.
arXiv Detail & Related papers (2020-11-02T17:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.