Exascale Workflow Applications and Middleware: An ExaWorks Retrospective
- URL: http://arxiv.org/abs/2411.10637v1
- Date: Sat, 16 Nov 2024 00:10:53 GMT
- Title: Exascale Workflow Applications and Middleware: An ExaWorks Retrospective
- Authors: Aymen Alsaadi, Mihael Hategan-Marandiuc, Ketan Maheshwari, Andre Merzky, Mikhail Titov, Matteo Turilli, Andreas Wilke, Justin M. Wozniak, Kyle Chard, Rafael Ferreira da Silva, Shantenu Jha, Daniel Laney,
- Abstract summary: We present the ExaWorks project, which addresses the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms.
We developed a workflow Software Development Toolkit (SDK), a job management abstraction API, and PSI/J, a minimal interface for submitting and monitoring jobs.
We discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of sustainably at the exascale.
- Score: 3.4423220997316593
- License:
- Abstract: Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which addresses many of these challenges. We developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms. ExaWorks also developed PSI/J, a job management abstraction API, to simplify the construction of portable software components and applications that can be used over various HPC schedulers. The PSI/J API is a minimal interface for submitting and monitoring jobs and their execution state across multiple and commonly used HPC schedulers. We also describe several leading and innovative workflow examples of ExaWorks tools used on DOE leadership platforms. Furthermore, we discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of workflows sustainably at the exascale.
Related papers
- WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration.
It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories.
LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z) - GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks.
We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models.
The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z) - ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies [3.1805622006446397]
Heterogeneous scientific discovery increasingly requires executing on high-performance computing platforms.
We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK)
The SDK is a collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms.
arXiv Detail & Related papers (2024-07-23T17:00:09Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks.
SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs.
We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z) - Leveraging Large Language Models to Build and Execute Computational
Workflows [40.572754656757475]
This paper explores how these emerging capabilities can be harnessed to facilitate complex scientific research.
We present initial findings from our attempt to integrate Phyloflow with OpenAI's function-calling API, and outline a strategy for developing a comprehensive workflow management system.
arXiv Detail & Related papers (2023-12-12T20:17:13Z) - Large Language Models to the Rescue: Reducing the Complexity in
Scientific Workflow Development Using ChatGPT [11.410608233274942]
Scientific systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets.
However, implementing is difficult due to the involvement of many blackbox tools and the deep infrastructure stack necessary for their execution.
We investigate the efficiency of Large Language Models, specifically ChatGPT, to support users when dealing with scientific domains.
arXiv Detail & Related papers (2023-11-03T10:28:53Z) - HPC-Coder: Modeling Parallel Programs using Large Language Models [2.3101915391170573]
We show how large language models can be applied to tasks specific to high performance and scientific codes.
We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models.
In our experiments, we show that this model can auto-complete HPC functions where generic models cannot.
arXiv Detail & Related papers (2023-06-29T19:44:55Z) - Composing Complex and Hybrid AI Solutions [52.00820391621739]
We describe an extension of the Acumos system towards enabling the above features for general AI applications.
Our extensions include support for more generic components with gRPC/Protobuf interfaces.
We provide examples of deployable solutions and their interfaces.
arXiv Detail & Related papers (2022-02-25T08:57:06Z) - YMIR: A Rapid Data-centric Development Platform for Vision Applications [82.67319997259622]
This paper introduces an open source platform for rapid development of computer vision applications.
The platform puts the efficient data development at the center of the machine learning development process.
arXiv Detail & Related papers (2021-11-19T05:02:55Z) - Collective Knowledge: organizing research projects as a database of
reusable components and portable workflows with common APIs [0.2538209532048866]
This article provides the motivation and overview of the Collective Knowledge framework (CK or cKnowledge)
The CK concept is to decompose research projects into reusable components that encapsulate research artifacts.
The long-term goal is to accelerate innovation by connecting researchers and practitioners to share and reuse all their knowledge.
arXiv Detail & Related papers (2020-11-02T17:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.