Related papers: Data Readiness for Scientific AI at Scale

Data Readiness for Scientific AI at Scale

URL: http://arxiv.org/abs/2507.23018v1
Date: Wed, 30 Jul 2025 18:30:37 GMT
Title: Data Readiness for Scientific AI at Scale
Authors: Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral,
Abstract summary: This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models.<n>We analyze archetypal across four representative domains - climate, nuclear fusion, bio/health, and materials.<n>We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard)
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.

Related papers

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization [68.46693401421923]
WebShaper systematically formalizes IS tasks through set theory.<n>WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.
arXiv Detail & Related papers (2025-07-20T17:53:37Z)
Data-Driven Breakthroughs and Future Directions in AI Infrastructure: A Comprehensive Review [0.0]
This paper presents a comprehensive synthesis of major breakthroughs in artificial intelligence (AI) over the past fifteen years.<n>It identifies key inflection points in AI' s evolution by tracing the convergence of computational resources, data access, and algorithmic innovation.
arXiv Detail & Related papers (2025-05-22T15:12:48Z)
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z)
Survey and Taxonomy: The Role of Data-Centric AI in Transformer-Based Time Series Forecasting [36.31269406067809]
We argue that data-centric AI is essential for training AI models, particularly for transformer-based TSF models efficiently. We review the previous research works from a data-centric AI perspective and we intend to lay the foundation work for the future development of transformer-based architecture and data-centric AI.
arXiv Detail & Related papers (2024-07-29T08:27:21Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
A survey of synthetic data augmentation methods in computer vision [0.0]
This paper presents an extensive review of synthetic data augmentation techniques. We focus on the important data generation and augmentation techniques, general scope of application and specific use-cases. We provide a summary of common synthetic datasets for training computer vision models.
arXiv Detail & Related papers (2024-03-15T07:34:08Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances [76.34037366117234]
We introduce a new dataset called Robot Control Gestures (RoCoG-v2) The dataset is composed of both real and synthetic videos from seven gesture classes. We present results using state-of-the-art action recognition and domain adaptation algorithms.
arXiv Detail & Related papers (2023-03-17T23:23:55Z)
FAIR AI Models in High Energy Physics [16.744801048170732]
We propose a practical definition of FAIR principles for AI models in experimental high energy physics. We describe a template for the application of these principles. We report on the robustness of this FAIR AI model, its portability across hardware architectures and software frameworks, and its interpretability.
arXiv Detail & Related papers (2022-12-09T19:00:18Z)
FAIR principles for AI models, with a practical application for accelerated high energy diffraction microscopy [1.9270896986812693]
We showcase how to create and share FAIR data and AI models within a unified computational framework. We describe how this domain-agnostic computational framework may be harnessed to enable autonomous AI-driven discovery.
arXiv Detail & Related papers (2022-07-01T18:11:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.