SimbaML: Connecting Mechanistic Models and Machine Learning with
Augmented Data
- URL: http://arxiv.org/abs/2304.04000v2
- Date: Sun, 9 Jul 2023 16:10:55 GMT
- Title: SimbaML: Connecting Mechanistic Models and Machine Learning with
Augmented Data
- Authors: Maximilian Kleissl, Lukas Drews, Benedict B. Heyder, Julian Zabbarov,
Pascal Iversen, Simon Witzke, Bernhard Y. Renard, Katharina Baum
- Abstract summary: SimbaML is an open-source tool that unifies realistic synthetic dataset generation from ordinary differential equation-based models.
SimbaML conveniently enables investigating transfer learning from synthetic to real-world data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training sophisticated machine learning (ML) models requires large datasets
that are difficult or expensive to collect for many applications. If prior
knowledge about system dynamics is available, mechanistic representations can
be used to supplement real-world data. We present SimbaML (Simulation-Based
ML), an open-source tool that unifies realistic synthetic dataset generation
from ordinary differential equation-based models and the direct analysis and
inclusion in ML pipelines. SimbaML conveniently enables investigating transfer
learning from synthetic to real-world data, data augmentation, identifying
needs for data collection, and benchmarking physics-informed ML approaches.
SimbaML is available from https://pypi.org/project/simba-ml/.
Related papers
- Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models [1.204452887718077]
We show how data management tools can significantly improve the quality of data that is used for machine learning (ML) applications.
We propose an architecture and implementation of such tools and demonstrate through two use cases how they can be used to improve ML-based eScience investigations.
arXiv Detail & Related papers (2024-06-27T04:42:29Z) - Verbalized Machine Learning: Revisiting Machine Learning with Language Models [63.10391314749408]
We introduce the framework of verbalized machine learning (VML)
VML constrains the parameter space to be human-interpretable natural language.
We empirically verify the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability.
arXiv Detail & Related papers (2024-06-06T17:59:56Z) - Harnessing Large Language Models as Post-hoc Correctors [6.288056740658763]
We show that an LLM can work as a post-hoc corrector to propose corrections for the predictions of an arbitrary Machine Learning model.
We form a contextual knowledge database by incorporating the dataset's label information and the ML model's predictions on the validation dataset.
Our experimental results on text analysis and the challenging molecular predictions show that model improves the performance of a number of models by up to 39%.
arXiv Detail & Related papers (2024-02-20T22:50:41Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Closing the loop: Autonomous experiments enabled by
machine-learning-based online data analysis in synchrotron beamline
environments [80.49514665620008]
Machine learning can be used to enhance research involving large or rapidly generated datasets.
In this study, we describe the incorporation of ML into a closed-loop workflow for X-ray reflectometry (XRR)
We present solutions that provide an elementary data analysis in real time during the experiment without introducing the additional software dependencies in the beamline control software environment.
arXiv Detail & Related papers (2023-06-20T21:21:19Z) - VeML: An End-to-End Machine Learning Lifecycle for Large-scale and
High-dimensional Data [0.0]
This paper introduces VeML, a Version management system dedicated to end-to-end machine learning lifecycle.
We address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset.
We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently.
arXiv Detail & Related papers (2023-04-25T07:32:16Z) - Synthetic data enable experiments in atomistic machine learning [0.0]
We demonstrate the use of a large dataset labelled with per-atom energies from an existing ML potential model.
The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints.
We show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets.
arXiv Detail & Related papers (2022-11-29T18:17:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.