Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at Scale
- URL: http://arxiv.org/abs/2407.11967v1
- Date: Tue, 16 Jul 2024 17:59:46 GMT
- Title: Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at Scale
- Authors: Aymen Alsaadi, Shantenu Jha, Matteo Turilli,
- Abstract summary: Hydra is an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms.
We present Hydra an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms.
- Score: 1.474723404975345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientific discovery increasingly depends on middleware that enables the execution of heterogeneous workflows on heterogeneous platforms One of the main challenges is to design software components that integrate within the existing ecosystem to enable scale and performance across cloud and high-performance computing HPC platforms Researchers are met with a varied computing landscape which includes services available on commercial cloud platforms data and network capabilities specifically designed for scientific discovery on government-sponsored cloud platforms and scale and performance on HPC platforms We present Hydra an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms and managing the execution of heterogeneous workflow applications on those resources This paper offers four main contributions (1) the design of brokering capabilities in the presence of task platform resource and middleware heterogeneity; (2) a reference implementation of that design with Hydra; (3) an experimental characterization of Hydra s overheads and strong weak scaling with heterogeneous workloads and platforms and, (4) the implementation of a workflow that models sea rise with Hydra and its scaling on cloud and HPC platforms
Related papers
- ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies [3.1805622006446397]
Heterogeneous scientific discovery increasingly requires executing on high-performance computing platforms.
We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK)
The SDK is a collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms.
arXiv Detail & Related papers (2024-07-23T17:00:09Z) - Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence [79.5316642687565]
Existing multi-agent frameworks often struggle with integrating diverse capable third-party agents.
We propose the Internet of Agents (IoA), a novel framework that addresses these limitations.
IoA introduces an agent integration protocol, an instant-messaging-like architecture design, and dynamic mechanisms for agent teaming and conversation flow control.
arXiv Detail & Related papers (2024-07-09T17:33:24Z) - One nine availability of a Photonic Quantum Computer on the Cloud toward
HPC integration [0.8961191069175432]
In November 2022, we introduced the first cloud-accessible general-purpose quantum computer based on single photons.
We describe the design and implementation of our cloud-accessible quantum computing platform, and demonstrate one nine availability (92 for external users during a six-month period, higher than most online services)
This work lay the foundation for advancing quantum computing accessibility and usability in hybrid HPC-QC infrastructures.
arXiv Detail & Related papers (2023-08-28T13:47:39Z) - A Transformer Framework for Data Fusion and Multi-Task Learning in Smart
Cities [99.56635097352628]
This paper proposes a Transformer-based AI system for emerging smart cities.
It supports virtually any input data and output task types present S&CCs.
It is demonstrated through learning diverse task sets representative of S&CC environments.
arXiv Detail & Related papers (2022-11-18T20:43:09Z) - YMIR: A Rapid Data-centric Development Platform for Vision Applications [82.67319997259622]
This paper introduces an open source platform for rapid development of computer vision applications.
The platform puts the efficient data development at the center of the machine learning development process.
arXiv Detail & Related papers (2021-11-19T05:02:55Z) - Case study of SARS-CoV-2 transmission risk assessment in indoor
environments using cloud computing resources [1.3150679728390269]
We showcase how a complex computational framework can be abstracted and deployed on cloud services.
We deploy the simulation framework on the Azure cloud framework, utilizing the Dendro-kT mesh generation tool and PETSc solvers.
We compare the performance of the cloud machines with state-of-the-art HPC machine TACC Frontera.
arXiv Detail & Related papers (2021-11-17T19:28:18Z) - Secure Platform for Processing Sensitive Data on Shared HPC Systems [0.0]
High performance computing clusters pose challenges for processing sensitive data.
In this work we present a novel method for creating secure computing environments on traditional multi-tenant high-performance computing clusters.
arXiv Detail & Related papers (2021-03-26T18:30:33Z) - Power Modeling for Effective Datacenter Planning and Compute Management [53.41102502425513]
We discuss two classes of statistical power models designed and validated to be accurate, simple, interpretable and applicable to all hardware configurations and workloads.
We demonstrate that the proposed statistical modeling techniques, while simple and scalable, predict power with less than 5% Mean Absolute Percent Error (MAPE) for more than 95% diverse Power Distribution Units (more than 2000) using only 4 features.
arXiv Detail & Related papers (2021-03-22T21:22:51Z) - Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments.
It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z) - Integrating Deep Learning in Domain Sciences at Exascale [2.241545093375334]
We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently.
We propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems.
We present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications with AI.
arXiv Detail & Related papers (2020-11-23T03:09:58Z) - Knowledge Integration of Collaborative Product Design Using Cloud
Computing Infrastructure [65.2157099438235]
The main focus of this paper is the concept of ongoing research in providing the knowledge integration service for collaborative product design and development using cloud computing infrastructure.
Proposed knowledge integration services support users by giving real-time access to knowledge resources.
arXiv Detail & Related papers (2020-01-16T18:44:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.