Workflow Provenance in the Lifecycle of Scientific Machine Learning
- URL: http://arxiv.org/abs/2010.00330v3
- Date: Wed, 25 Aug 2021 14:26:33 GMT
- Title: Workflow Provenance in the Lifecycle of Scientific Machine Learning
- Authors: Renan Souza, Leonardo G. Azevedo, V\'itor Louren\c{c}o, Elton Soares,
Raphael Thiago, Rafael Brand\~ao, Daniel Civitarese, Emilio Vital Brazil,
Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A.
S. Netto
- Abstract summary: We leverage workflow techniques to build a holistic view to support the lifecycle of scientific ML.
We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs.
- Score: 1.6118907823528272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Learning (ML) has already fundamentally changed several businesses.
More recently, it has also been profoundly impacting the computational science
and engineering domains, like geoscience, climate science, and health science.
In these domains, users need to perform comprehensive data analyses combining
scientific data and ML models to provide for critical requirements, such as
reproducibility, model explainability, and experiment data understanding.
However, scientific ML is multidisciplinary, heterogeneous, and affected by the
physical constraints of the domain, making such analyses even more challenging.
In this work, we leverage workflow provenance techniques to build a holistic
view to support the lifecycle of scientific ML. We contribute with (i)
characterization of the lifecycle and taxonomy for data analyses; (ii) design
principles to build this view, with a W3C PROV compliant data representation
and a reference system architecture; and (iii) lessons learned after an
evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946
GPUs. The experiments show that the principles enable queries that integrate
domain semantics with ML models while keeping low overhead (<1%), high
scalability, and an order of magnitude of query acceleration under certain
workloads against without our representation.
Related papers
- Recent Advances on Machine Learning for Computational Fluid Dynamics: A Survey [51.87875066383221]
This paper introduces fundamental concepts, traditional methods, and benchmark datasets, then examine the various roles Machine Learning plays in improving CFD.
We highlight real-world applications of ML for CFD in critical scientific and engineering disciplines, including aerodynamics, combustion, atmosphere & ocean science, biology fluid, plasma, symbolic regression, and reduced order modeling.
We draw the conclusion that ML is poised to significantly transform CFD research by enhancing simulation accuracy, reducing computational time, and enabling more complex analyses of fluid dynamics.
arXiv Detail & Related papers (2024-08-22T07:33:11Z) - Improving Molecular Modeling with Geometric GNNs: an Empirical Study [56.52346265722167]
This paper focuses on the impact of different canonicalization methods, (2) graph creation strategies, and (3) auxiliary tasks, on performance, scalability and symmetry enforcement.
Our findings and insights aim to guide researchers in selecting optimal modeling components for molecular modeling tasks.
arXiv Detail & Related papers (2024-07-11T09:04:12Z) - Opportunities for machine learning in scientific discovery [16.526872562935463]
We review how the scientific community can increasingly leverage machine-learning techniques to achieve scientific discoveries.
Although challenges remain, principled use of ML is opening up new avenues for fundamental scientific discoveries.
arXiv Detail & Related papers (2024-05-07T09:58:02Z) - ML4PhySim : Machine Learning for Physical Simulations Challenge (The
airfoil design) [16.140736542578562]
The aim of this competition is to encourage the development of new ML techniques to solve physical problems.
We propose learning a task representing the airfoil design simulation, using a dataset called AirfRANS.
To the best of our knowledge, this is the first competition addressing the use of ML-based surrogate approaches to improve the trade-off computational cost/accuracy of physical simulation.
arXiv Detail & Related papers (2024-03-03T22:10:21Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z) - A Mass-Conserving-Perceptron for Machine Learning-Based Modeling of Geoscientific Systems [1.1510009152620668]
We propose a physically-interpretable Mass Conserving Perceptron (MCP) as a way to bridge the gap between PC-based and ML-based modeling approaches.
The MCP exploits the inherent isomorphism between the directed graph structures underlying both PC models and GRNNs to explicitly represent the mass-conserving nature of physical processes.
arXiv Detail & Related papers (2023-10-12T18:09:33Z) - Discovering Interpretable Physical Models using Symbolic Regression and
Discrete Exterior Calculus [55.2480439325792]
We propose a framework that combines Symbolic Regression (SR) and Discrete Exterior Calculus (DEC) for the automated discovery of physical models.
DEC provides building blocks for the discrete analogue of field theories, which are beyond the state-of-the-art applications of SR to physical problems.
We prove the effectiveness of our methodology by re-discovering three models of Continuum Physics from synthetic experimental data.
arXiv Detail & Related papers (2023-10-10T13:23:05Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z) - Machine Learning in Nano-Scale Biomedical Engineering [77.75587007080894]
We review the existing research regarding the use of machine learning in nano-scale biomedical engineering.
The main challenges that can be formulated as ML problems are classified into the three main categories.
For each of the presented methodologies, special emphasis is given to its principles, applications, and limitations.
arXiv Detail & Related papers (2020-08-05T15:45:54Z) - Complete CVDL Methodology for Investigating Hydrodynamic Instabilities [0.49873153106566565]
In fluid dynamics, one of the most important research fields is hydrodynamic instabilities and their evolution in different flow regimes.
Currently, three main methods are used for understanding such phenomenon - namely analytical models, experiments and simulations.
We claim and demonstrate that a major portion of this research effort could and should be analysed using recent breakthrough advancements in the field of Computer Vision with Deep Learning (CVDL, or Deep Computer-Vision)
Specifically, we focus in this research on one of the most representative instabilities, the Rayleigh-Taylor one, simulate its behaviour and create an open-sourced state-of-the
arXiv Detail & Related papers (2020-04-03T13:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.