ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies
- URL: http://arxiv.org/abs/2407.16646v1
- Date: Tue, 23 Jul 2024 17:00:09 GMT
- Title: ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies
- Authors: Matteo Turilli, Mihael Hategan-Marandiuc, Mikhail Titov, Ketan Maheshwari, Aymen Alsaadi, Andre Merzky, Ramon Arambula, Mikhail Zakharchanka, Matt Cowan, Justin M. Wozniak, Andreas Wilke, Ozgur Ozan Kilic, Kyle Chard, Rafael Ferreira da Silva, Shantenu Jha, Daniel Laney,
- Abstract summary: Heterogeneous scientific discovery increasingly requires executing on high-performance computing platforms.
We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK)
The SDK is a collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms.
- Score: 3.1805622006446397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.
Related papers
- Exascale Workflow Applications and Middleware: An ExaWorks Retrospective [3.4423220997316593]
We present the ExaWorks project, which addresses the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms.
We developed a workflow Software Development Toolkit (SDK), a job management abstraction API, and PSI/J, a minimal interface for submitting and monitoring jobs.
We discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of sustainably at the exascale.
arXiv Detail & Related papers (2024-11-16T00:10:53Z) - Final Report for CHESS: Cloud, High-Performance Computing, and Edge for Science and Security [5.781151161558928]
Methods for constructing continuum platforms, orchestrating workflow tasks, and curating datasets fail to achieve scientific requirements for performance, energy, security, and reliability.
Report describes the results and successes of CHESS from the perspective of open science.
arXiv Detail & Related papers (2024-10-21T15:16:00Z) - GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks.
We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models.
The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z) - Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at Scale [1.474723404975345]
Hydra is an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms.
We present Hydra an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms.
arXiv Detail & Related papers (2024-07-16T17:59:46Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - Reusability Challenges of Scientific Workflows: A Case Study for Galaxy [56.78572674167333]
This study examined the reusability of existing and exposed several challenges.
The challenges preventing reusability include tool upgrading, tool support, design flaws, incomplete, failure to load a workflow, etc.
arXiv Detail & Related papers (2023-09-13T20:17:43Z) - The GitHub Development Workflow Automation Ecosystems [47.818229204130596]
Large-scale software development has become a highly collaborative endeavour.
This chapter explores the ecosystems of development bots and GitHub Actions.
It provides an extensive survey of the state-of-the-art in this domain.
arXiv Detail & Related papers (2023-05-08T15:24:23Z) - YMIR: A Rapid Data-centric Development Platform for Vision Applications [82.67319997259622]
This paper introduces an open source platform for rapid development of computer vision applications.
The platform puts the efficient data development at the center of the machine learning development process.
arXiv Detail & Related papers (2021-11-19T05:02:55Z) - hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power
Machine Learning Devices [0.6353764569103648]
In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries.
We have developed hls4ml, an open-source software- hardware codesign workflow to interpret and translate machine learning algorithms.
We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations.
arXiv Detail & Related papers (2021-03-09T17:34:44Z) - Technology Readiness Levels for Machine Learning Systems [107.56979560568232]
Development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end.
We have developed a proven systems engineering approach for machine learning development and deployment.
Our "Machine Learning Technology Readiness Levels" framework defines a principled process to ensure robust, reliable, and responsible systems.
arXiv Detail & Related papers (2021-01-11T15:54:48Z) - Collective Knowledge: organizing research projects as a database of
reusable components and portable workflows with common APIs [0.2538209532048866]
This article provides the motivation and overview of the Collective Knowledge framework (CK or cKnowledge)
The CK concept is to decompose research projects into reusable components that encapsulate research artifacts.
The long-term goal is to accelerate innovation by connecting researchers and practitioners to share and reuse all their knowledge.
arXiv Detail & Related papers (2020-11-02T17:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.