Related papers: Metadata practices for simulation workflows

Metadata practices for simulation workflows

URL: http://arxiv.org/abs/2408.17309v1
Date: Fri, 30 Aug 2024 14:12:31 GMT
Title: Metadata practices for simulation workflows
Authors: Jose Villamar, Matthias Kelbling, Heather L. More, Michael Denker, Tom Tetzlaff, Johanna Senk, Stephan Thober,
Abstract summary: We present general practices for acquiring and handling metadata that are agnostic to software and hardware. These consist of two steps: 1) recording and storing raw metadata, and 2) selecting and structuring metadata. As a proof of concept, we develop the Archivist, a Python tool to help with the second step, and use it to apply our practices to distinct high-performance use cases from neuroscience and hydrology.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computer simulations are an essential pillar of knowledge generation in science. Understanding, reproducing, and exploring the results of simulations relies on tracking and organizing metadata describing numerical experiments. However, the models used to understand real-world systems, and the computational machinery required to simulate them, are typically complex, and produce large amounts of heterogeneous metadata. Here, we present general practices for acquiring and handling metadata that are agnostic to software and hardware, and highly flexible for the user. These consist of two steps: 1) recording and storing raw metadata, and 2) selecting and structuring metadata. As a proof of concept, we develop the Archivist, a Python tool to help with the second step, and use it to apply our practices to distinct high-performance computing use cases from neuroscience and hydrology. Our practices and the Archivist can readily be applied to existing workflows without the need for substantial restructuring. They support sustainable numerical workflows, facilitating reproducibility and data reuse in generic simulation-based research.

Related papers

The Artificial Scientist -- in-transit Machine Learning of Plasma Simulations [33.024345484180024]
We demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework. With the presented workflow, data operations can be performed in common and easy-to-use programming languages.
arXiv Detail & Related papers (2025-01-06T20:58:27Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning [4.812580392361432]
Well is a large-scale collection of numerical simulations of a wide variety of physical systems.<n>These datasets can be used individually or as part of a broader benchmark suite.<n>We provide a unified PyTorch interface for training and evaluating models.
arXiv Detail & Related papers (2024-11-30T19:42:14Z)
Improving Radiography Machine Learning Workflows via Metadata Management for Training Data Selection [0.0]
In the physical sciences, there is an ever-increasing pool of metadata that is generated by the scientific research cycle. Tracking this metadata can reduce redundant work, improve, and aid in the feature and training dataset engineering process. We present a tool for machine learning metadata management in dynamic radiography.
arXiv Detail & Related papers (2024-08-22T18:01:21Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
Simulation-Based Parallel Training [55.41644538483948]
We present our ongoing work to design a training framework that alleviates those bottlenecks. It generates data in parallel with the training process. We present a strategy to mitigate this bias with a memory buffer.
arXiv Detail & Related papers (2022-11-08T09:31:25Z)
Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models. The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Synthetic Benchmarks for Scientific Research in Explainable Machine Learning [14.172740234933215]
We release XAI-Bench: a suite of synthetic datasets and a library for benchmarking feature attribution algorithms. Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values. We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and identifying failure modes for popular explainers.
arXiv Detail & Related papers (2021-06-23T17:10:21Z)
Probabilistic Active Meta-Learning [15.432006404678981]
We introduce task selection based on prior experience into a meta-learning algorithm. We provide empirical evidence that our approach improves data-efficiency when compared to strong baselines on simulated robotic experiments.
arXiv Detail & Related papers (2020-07-17T12:51:42Z)
PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms [4.060731229044571]
We present a trace-driven simulation-based experimentation and analytics environment for large-scale AI systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model. We implement the model in a standalone, discrete event simulator, and provide a toolkit for running experiments.
arXiv Detail & Related papers (2020-06-22T19:55:37Z)
Provable Meta-Learning of Linear Representations [114.656572506859]
We provide fast, sample-efficient algorithms to address the dual challenges of learning a common set of features from multiple, related tasks, and transferring this knowledge to new, unseen tasks. We also provide information-theoretic lower bounds on the sample complexity of learning these linear features.
arXiv Detail & Related papers (2020-02-26T18:21:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.