Workflow Provenance in the Lifecycle of Scientific Machine Learning
- URL: http://arxiv.org/abs/2010.00330v3
- Date: Wed, 25 Aug 2021 14:26:33 GMT
- Title: Workflow Provenance in the Lifecycle of Scientific Machine Learning
- Authors: Renan Souza, Leonardo G. Azevedo, V\'itor Louren\c{c}o, Elton Soares,
Raphael Thiago, Rafael Brand\~ao, Daniel Civitarese, Emilio Vital Brazil,
Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A.
S. Netto
- Abstract summary: We leverage workflow techniques to build a holistic view to support the lifecycle of scientific ML.
We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs.
- Score: 1.6118907823528272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Learning (ML) has already fundamentally changed several businesses.
More recently, it has also been profoundly impacting the computational science
and engineering domains, like geoscience, climate science, and health science.
In these domains, users need to perform comprehensive data analyses combining
scientific data and ML models to provide for critical requirements, such as
reproducibility, model explainability, and experiment data understanding.
However, scientific ML is multidisciplinary, heterogeneous, and affected by the
physical constraints of the domain, making such analyses even more challenging.
In this work, we leverage workflow provenance techniques to build a holistic
view to support the lifecycle of scientific ML. We contribute with (i)
characterization of the lifecycle and taxonomy for data analyses; (ii) design
principles to build this view, with a W3C PROV compliant data representation
and a reference system architecture; and (iii) lessons learned after an
evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946
GPUs. The experiments show that the principles enable queries that integrate
domain semantics with ML models while keeping low overhead (<1%), high
scalability, and an order of magnitude of query acceleration under certain
workloads against without our representation.
Related papers
- MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science [62.96434290874878]
Current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks.
We develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM.
MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator.
arXiv Detail & Related papers (2025-01-18T13:54:00Z) - Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries [23.111935712144277]
Rapid yet accurate simulations of fluid dynamics around complex geometries is critical in a variety of engineering and scientific applications.
While scientific machine learning (SciML) has shown promise, most studies are constrained to simple geometries.
This study addresses this gap by benchmarking diverse SciML models for fluid flow prediction over intricate geometries.
arXiv Detail & Related papers (2024-12-31T00:23:15Z) - Data-Efficient Inference of Neural Fluid Fields via SciML Foundation Model [49.06911227670408]
We show that SciML foundation model can significantly improve the data efficiency of inferring real-world 3D fluid dynamics with improved generalization.
We equip neural fluid fields with a novel collaborative training approach that utilizes augmented views and fluid features extracted by our foundation model.
arXiv Detail & Related papers (2024-12-18T14:39:43Z) - Using Machine Learning to Discover Parsimonious and Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff Dynamics [1.1510009152620668]
An underexplored aspect of machine learning is how to develop minimally-optimal representations.
Our own view is that ML-based modeling should be based in use of computational units that are fundamentally interpretable by design.
We show, in the context of lumped catchment modeling, that physical interpretability and predictive performance can both be achieved using a relatively parsimonious distributed-state network.
arXiv Detail & Related papers (2024-12-06T08:30:01Z) - Recent Advances on Machine Learning for Computational Fluid Dynamics: A Survey [51.87875066383221]
This paper introduces fundamental concepts, traditional methods, and benchmark datasets, then examine the various roles Machine Learning plays in improving CFD.
We highlight real-world applications of ML for CFD in critical scientific and engineering disciplines, including aerodynamics, combustion, atmosphere & ocean science, biology fluid, plasma, symbolic regression, and reduced order modeling.
We draw the conclusion that ML is poised to significantly transform CFD research by enhancing simulation accuracy, reducing computational time, and enabling more complex analyses of fluid dynamics.
arXiv Detail & Related papers (2024-08-22T07:33:11Z) - Improving Molecular Modeling with Geometric GNNs: an Empirical Study [56.52346265722167]
This paper focuses on the impact of different canonicalization methods, (2) graph creation strategies, and (3) auxiliary tasks, on performance, scalability and symmetry enforcement.
Our findings and insights aim to guide researchers in selecting optimal modeling components for molecular modeling tasks.
arXiv Detail & Related papers (2024-07-11T09:04:12Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.
We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.
Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Opportunities for machine learning in scientific discovery [16.526872562935463]
We review how the scientific community can increasingly leverage machine-learning techniques to achieve scientific discoveries.
Although challenges remain, principled use of ML is opening up new avenues for fundamental scientific discoveries.
arXiv Detail & Related papers (2024-05-07T09:58:02Z) - Machine Learning in Nano-Scale Biomedical Engineering [77.75587007080894]
We review the existing research regarding the use of machine learning in nano-scale biomedical engineering.
The main challenges that can be formulated as ML problems are classified into the three main categories.
For each of the presented methodologies, special emphasis is given to its principles, applications, and limitations.
arXiv Detail & Related papers (2020-08-05T15:45:54Z) - Complete CVDL Methodology for Investigating Hydrodynamic Instabilities [0.49873153106566565]
In fluid dynamics, one of the most important research fields is hydrodynamic instabilities and their evolution in different flow regimes.
Currently, three main methods are used for understanding such phenomenon - namely analytical models, experiments and simulations.
We claim and demonstrate that a major portion of this research effort could and should be analysed using recent breakthrough advancements in the field of Computer Vision with Deep Learning (CVDL, or Deep Computer-Vision)
Specifically, we focus in this research on one of the most representative instabilities, the Rayleigh-Taylor one, simulate its behaviour and create an open-sourced state-of-the
arXiv Detail & Related papers (2020-04-03T13:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.