Towards Lightweight Data Integration using Multi-workflow Provenance and
Data Observability
- URL: http://arxiv.org/abs/2308.09004v1
- Date: Thu, 17 Aug 2023 14:20:29 GMT
- Title: Towards Lightweight Data Integration using Multi-workflow Provenance and
Data Observability
- Authors: Renan Souza, Tyler J. Skluzacek, Sean R. Wilkinson, Maxim Ziatdinov,
Rafael Ferreira da Silva
- Abstract summary: Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era.
We propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis.
We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.
- Score: 0.2517763905487249
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern large-scale scientific discovery requires multidisciplinary
collaboration across diverse computing facilities, including High Performance
Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data
analysis plays a crucial role in scientific discovery, especially in the
current AI era, by enabling Responsible AI development, FAIR, Reproducibility,
and User Steering. However, the heterogeneous nature of science poses
challenges such as dealing with multiple supporting tools, cross-facility
environments, and efficient HPC execution. Building on data observability,
adapter system design, and provenance, we propose MIDA: an approach for
lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data
observability strategies and adaptability methods for various parallel systems
and machine learning tools. With observability, it intercepts the dataflows in
the background without requiring instrumentation while integrating domain,
provenance, and telemetry data at runtime into a unified database ready for
user steering queries. We conduct experiments showing end-to-end multi-workflow
analysis integrating data from Dask and MLFlow in a real distributed deep
learning use case for materials science that runs on multiple environments with
up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000
tasks on 1,680 CPU cores on the Summit supercomputer.
Related papers
- Final Report for CHESS: Cloud, High-Performance Computing, and Edge for Science and Security [5.781151161558928]
Methods for constructing continuum platforms, orchestrating workflow tasks, and curating datasets fail to achieve scientific requirements for performance, energy, security, and reliability.
Report describes the results and successes of CHESS from the perspective of open science.
arXiv Detail & Related papers (2024-10-21T15:16:00Z) - Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research [90.91438597133211]
We introduce WarpSci, a framework designed to overcome crucial system bottlenecks in the application of reinforcement learning.
We eliminate the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations.
arXiv Detail & Related papers (2024-08-01T21:38:09Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - EDALearn: A Comprehensive RTL-to-Signoff EDA Benchmark for Democratized
and Reproducible ML for EDA Research [5.093676641214663]
We introduce EDALearn, the first holistic, open-source benchmark suite specifically for Machine Learning tasks in EDA.
This benchmark suite presents an end-to-end flow from synthesis to physical implementation, enriching data collection across various stages.
Our contributions aim to encourage further advances in the ML-EDA domain.
arXiv Detail & Related papers (2023-12-04T06:51:46Z) - Multi-Fidelity Active Learning with GFlowNets [65.91555804996203]
We propose a multi-fidelity active learning algorithm with GFlowNets as a sampler, to efficiently discover diverse, high-scoring candidates.
Our evaluation on molecular discovery tasks shows that multi-fidelity active learning with GFlowNets can discover high-scoring candidates at a fraction of the budget of its single-fidelity counterpart.
arXiv Detail & Related papers (2023-06-20T17:43:42Z) - Distributed intelligence on the Edge-to-Cloud Continuum: A systematic
literature review [62.997667081978825]
This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today.
The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed.
arXiv Detail & Related papers (2022-04-29T08:06:05Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z) - PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms [4.060731229044571]
We present a trace-driven simulation-based experimentation and analytics environment for large-scale AI systems.
Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model.
We implement the model in a standalone, discrete event simulator, and provide a toolkit for running experiments.
arXiv Detail & Related papers (2020-06-22T19:55:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.