Related papers: Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

URL: http://arxiv.org/abs/2407.01608v1
Date: Thu, 27 Jun 2024 04:42:29 GMT
Title: Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models
Authors: Zhiwei Li, Carl Kesselman, Mike D'Arch, Michael Pazzani, Benjamin Yizing Xu,
Abstract summary: We show how data management tools can significantly improve the quality of data that is used for machine learning (ML) applications. We propose an architecture and implementation of such tools and demonstrate through two use cases how they can be used to improve ML-based eScience investigations.
Score: 1.204452887718077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Increasingly, artificial intelligence (AI) and machine learning (ML) are used in eScience applications [9]. While these approaches have great potential, the literature has shown that ML-based approaches frequently suffer from results that are either incorrect or unreproducible due to mismanagement or misuse of data used for training and validating the models [12, 15]. Recognition of the necessity of high-quality data for correct ML results has led to data-centric ML approaches that shift the central focus from model development to creation of high-quality data sets to train and validate the models [14, 20]. However, there are limited tools and methods available for data-centric approaches to explore and evaluate ML solutions for eScience problems which often require collaborative multidisciplinary teams working with models and data that will rapidly evolve as an investigation unfolds [1]. In this paper, we show how data management tools based on the principle that all of the data for ML should be findable, accessible, interoperable and reusable (i.e. FAIR [26]) can significantly improve the quality of data that is used for ML applications. When combined with best practices that apply these tools to the entire life cycle of an ML-based eScience investigation, we can significantly improve the ability of an eScience team to create correct and reproducible ML solutions. We propose an architecture and implementation of such tools and demonstrate through two use cases how they can be used to improve ML-based eScience investigations.

Related papers

The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback [51.144727949988436]
Reinforcement learning (RL) has demonstrated potential to enhance the reasoning capabilities of large language models (LLMs)<n>In this work, we explore improving LLMs through RL with minimal data.<n>To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness.
arXiv Detail & Related papers (2025-10-03T06:32:10Z)
Agents of Discovery [17.016402322416994]
Large language models (LLMs) can jointly solve data analysis-based research problems.<n>We consider the task of anomaly detection via the publicly available and highly-studied LHC Olympics dataset.<n>The best agent-created solutions mirror the performance of human state-of-the-art results.
arXiv Detail & Related papers (2025-09-10T12:25:13Z)
Data Heterogeneity Modeling for Trustworthy Machine Learning [25.732841312561586]
Data heterogeneity plays a pivotal role in determining the performance of machine learning (ML) systems.<n>Traditional algorithms often overlook the intrinsic diversity within datasets.<n>We show how a deeper understanding of data diversity can enhance model robustness, fairness, and reliability.
arXiv Detail & Related papers (2025-06-01T11:36:56Z)
ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution [77.86222359025011]
We propose ToolACE-DEV, a self-improving framework for tool learning.<n>First, we decompose the tool-learning objective into sub-tasks that enhance basic tool-making and tool-using abilities.<n>We then introduce a self-evolving paradigm that allows lightweight models to self-improve, reducing reliance on advanced LLMs.
arXiv Detail & Related papers (2025-05-12T12:48:30Z)
Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Stronger Baseline Models -- A Key Requirement for Aligning Machine Learning Research with Clinical Utility [0.0]
Well-known barriers exist when attempting to deploy Machine Learning models in high-stakes, clinical settings. We show empirically that including stronger baseline models in evaluations has important downstream effects. We propose some best practices that will enable practitioners to more effectively study and deploy ML models in clinical settings.
arXiv Detail & Related papers (2024-09-18T16:38:37Z)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16 [0.29998889086656577]
We show that relatively minor modifications on a benchmark dataset cause significantly more impact on model performance than the specific ML technique considered. We also show that the measured model performance is uncertain, as a result of labelling inaccuracies.
arXiv Detail & Related papers (2023-05-31T12:03:12Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Utilizing Domain Knowledge: Robust Machine Learning for Building Energy Prediction with Small, Inconsistent Datasets [1.1081836812143175]
The demand for a huge amount of data for machine learning (ML) applications is currently a bottleneck. We propose a method to combine prior knowledge with data-driven methods to significantly reduce their data dependency. CBML as the knowledge-encoded data-driven method is examined in the context of energy-efficient building engineering.
arXiv Detail & Related papers (2023-01-23T08:56:11Z)
Automatic Data Augmentation via Invariance-Constrained Learning [94.27081585149836]
Underlying data structures are often exploited to improve the solution of learning tasks. Data augmentation induces these symmetries during training by applying multiple transformations to the input data. This work tackles these issues by automatically adapting the data augmentation while solving the learning task.
arXiv Detail & Related papers (2022-09-29T18:11:01Z)
Towards Model-informed Precision Dosing with Expert-in-the-loop Machine Learning [0.0]
We consider a ML framework that may accelerate model learning and improve its interpretability by incorporating human experts into the model learning loop. We propose a novel human-in-the-loop ML framework aimed at dealing with learning problems that the cost of data annotation is high. With an application to precision dosing, our experimental results show that the approach can learn interpretable rules from data and may potentially lower experts' workload.
arXiv Detail & Related papers (2021-06-28T03:45:09Z)
A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions. Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data. Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z)
Transfer Learning without Knowing: Reprogramming Black-box Machine Learning Models with Scarce Data and Limited Resources [78.72922528736011]
We propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box machine learning model. Using zeroth order optimization and multi-label mapping techniques, BAR can reprogram a black-box ML model solely based on its input-output responses. BAR outperforms state-of-the-art methods and yields comparable performance to the vanilla adversarial reprogramming method.
arXiv Detail & Related papers (2020-07-17T01:52:34Z)
Insights into Performance Fitness and Error Metrics for Machine Learning [1.827510863075184]
Machine learning (ML) is the field of training machines to achieve high level of cognition and perform human-like analysis. This paper examines a number of the most commonly-used performance fitness and error metrics for regression and classification algorithms.
arXiv Detail & Related papers (2020-05-17T22:59:04Z)
Injective Domain Knowledge in Neural Networks for Transprecision Computing [17.300144121921882]
This paper studies the improvements that can be obtained by integrating prior knowledge when dealing with a non-trivial learning task. The results clearly show that ML models exploiting problem-specific information outperform the purely data-driven ones, with an average accuracy improvement around 38%.
arXiv Detail & Related papers (2020-02-24T12:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.