From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience
- URL: http://arxiv.org/abs/2506.16051v1
- Date: Thu, 19 Jun 2025 06:09:01 GMT
- Title: From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience
- Authors: Zhiwei Li, Carl Kesselman, Tran Huy Nguyen, Benjamin Yixing Xu, Kyle Bolo, Kimberley Yu,
- Abstract summary: Reproducibility remains a central challenge in machine learning (ML)<n>Current ML are often fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools.<n>This paper introduces a data-centric framework for lifecycle-aware artifacts.
- Score: 1.136688282190268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.
Related papers
- MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models [1.204452887718077]
We show how data management tools can significantly improve the quality of data that is used for machine learning (ML) applications.
We propose an architecture and implementation of such tools and demonstrate through two use cases how they can be used to improve ML-based eScience investigations.
arXiv Detail & Related papers (2024-06-27T04:42:29Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Closing the loop: Autonomous experiments enabled by
machine-learning-based online data analysis in synchrotron beamline
environments [80.49514665620008]
Machine learning can be used to enhance research involving large or rapidly generated datasets.
In this study, we describe the incorporation of ML into a closed-loop workflow for X-ray reflectometry (XRR)
We present solutions that provide an elementary data analysis in real time during the experiment without introducing the additional software dependencies in the beamline control software environment.
arXiv Detail & Related papers (2023-06-20T21:21:19Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Management of Machine Learning Lifecycle Artifacts: A Survey [7.106986689736826]
We aim to give an overview of systems and platforms which support the management of machine learning lifecycle artifacts.
Based on a systematic review, we derive assessment criteria and apply them to a representative selection of more than 60 systems and platforms.
arXiv Detail & Related papers (2022-10-21T09:23:12Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Enabling Reproducibility and Meta-learning Through a Lifelong Database
of Experiments (LDE) [0.43012765978447565]
We present the Lifelong Database of Experiments (LDE) that automatically extracts and stores linked metadata from experiment artifacts.
We store context from multiple stages of the AI development lifecycle including datasets, pipelines, how each is configured, and training runs with information about their runtime environment.
We perform two experiments on this metadata: 1) examining the variability of the performance metrics and 2) implementing a number of meta-learning algorithms on top of the data.
arXiv Detail & Related papers (2022-02-22T15:35:16Z) - Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data
Principles [0.0]
We describe our goals and initial steps in supporting the end-to-end of machine learning pipelines.
We investigate which factors beyond the availability of source code and datasets influence the influence of ML experiments.
We propose ways to apply FAIR data practices to ML experiments.
arXiv Detail & Related papers (2020-06-22T10:17:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.