Related papers: Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at Scale

Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at Scale

URL: http://arxiv.org/abs/2407.11967v1
Date: Tue, 16 Jul 2024 17:59:46 GMT
Title: Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at Scale
Authors: Aymen Alsaadi, Shantenu Jha, Matteo Turilli,
Abstract summary: Hydra is an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms. We present Hydra an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms.
Score: 1.474723404975345
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scientific discovery increasingly depends on middleware that enables the execution of heterogeneous workflows on heterogeneous platforms One of the main challenges is to design software components that integrate within the existing ecosystem to enable scale and performance across cloud and high-performance computing HPC platforms Researchers are met with a varied computing landscape which includes services available on commercial cloud platforms data and network capabilities specifically designed for scientific discovery on government-sponsored cloud platforms and scale and performance on HPC platforms We present Hydra an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms and managing the execution of heterogeneous workflow applications on those resources This paper offers four main contributions (1) the design of brokering capabilities in the presence of task platform resource and middleware heterogeneity; (2) a reference implementation of that design with Hydra; (3) an experimental characterization of Hydra s overheads and strong weak scaling with heterogeneous workloads and platforms and, (4) the implementation of a workflow that models sea rise with Hydra and its scaling on cloud and HPC platforms

Related papers

Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications [2.0999841017238063]
Hybrid combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out- HPC. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads.
arXiv Detail & Related papers (2025-03-17T16:21:48Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models [64.28420991770382]
We present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators. The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
Transforming the Hybrid Cloud for Emerging AI Workloads [81.15269563290326]
This white paper envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads. The proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms.
arXiv Detail & Related papers (2024-11-20T11:57:43Z)
Exascale Workflow Applications and Middleware: An ExaWorks Retrospective [3.4423220997316593]
We present the ExaWorks project, which addresses the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We developed a workflow Software Development Toolkit (SDK), a job management abstraction API, and PSI/J, a minimal interface for submitting and monitoring jobs. We discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of sustainably at the exascale.
arXiv Detail & Related papers (2024-11-16T00:10:53Z)
GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z)
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning [49.997801914237094]
We introduce the Fire-Flyer AI- HPC architecture, a synergistic hardware-software co-design framework and its best practices. For Deep Learning (DL) training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication.
arXiv Detail & Related papers (2024-08-26T10:11:56Z)
ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies [3.1805622006446397]
Heterogeneous scientific discovery increasingly requires executing on high-performance computing platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK) The SDK is a collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms.
arXiv Detail & Related papers (2024-07-23T17:00:09Z)
One nine availability of a Photonic Quantum Computer on the Cloud toward HPC integration [0.8961191069175432]
In November 2022, we introduced the first cloud-accessible general-purpose quantum computer based on single photons. We describe the design and implementation of our cloud-accessible quantum computing platform, and demonstrate one nine availability (92 for external users during a six-month period, higher than most online services) This work lay the foundation for advancing quantum computing accessibility and usability in hybrid HPC-QC infrastructures.
arXiv Detail & Related papers (2023-08-28T13:47:39Z)
A Transformer Framework for Data Fusion and Multi-Task Learning in Smart Cities [99.56635097352628]
This paper proposes a Transformer-based AI system for emerging smart cities. It supports virtually any input data and output task types present S&CCs. It is demonstrated through learning diverse task sets representative of S&CC environments.
arXiv Detail & Related papers (2022-11-18T20:43:09Z)
YMIR: A Rapid Data-centric Development Platform for Vision Applications [82.67319997259622]
This paper introduces an open source platform for rapid development of computer vision applications. The platform puts the efficient data development at the center of the machine learning development process.
arXiv Detail & Related papers (2021-11-19T05:02:55Z)
Secure Platform for Processing Sensitive Data on Shared HPC Systems [0.0]
High performance computing clusters pose challenges for processing sensitive data. In this work we present a novel method for creating secure computing environments on traditional multi-tenant high-performance computing clusters.
arXiv Detail & Related papers (2021-03-26T18:30:33Z)
Power Modeling for Effective Datacenter Planning and Compute Management [53.41102502425513]
We discuss two classes of statistical power models designed and validated to be accurate, simple, interpretable and applicable to all hardware configurations and workloads. We demonstrate that the proposed statistical modeling techniques, while simple and scalable, predict power with less than 5% Mean Absolute Percent Error (MAPE) for more than 95% diverse Power Distribution Units (more than 2000) using only 4 features.
arXiv Detail & Related papers (2021-03-22T21:22:51Z)
Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments. It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z)
Integrating Deep Learning in Domain Sciences at Exascale [2.241545093375334]
We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently. We propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems. We present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications with AI.
arXiv Detail & Related papers (2020-11-23T03:09:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.