On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning
- URL: http://arxiv.org/abs/2401.10304v2
- Date: Tue, 17 Dec 2024 16:34:49 GMT
- Title: On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning
- Authors: Joan Giner-Miguelez, Abel Gómez, Jordi Cabot,
- Abstract summary: We examine a sample of 4041 data papers of different domains, assessing their completeness, coverage of the requested dimensions, and trends in recent years.
We propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in machine learning technologies.
- Score: 1.8270184406083445
- License:
- Abstract: To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions' adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness, coverage of the requested dimensions, and trends in recent years. We focus on the most and least documented dimensions and compare the results with those of an ML-focused venue (NeurIPS D&B track) publishing papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.
Related papers
- Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents [8.516310581591426]
This study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance.
We aim to improve the accessibility of scientific documents and facilitate their wider use.
arXiv Detail & Related papers (2025-01-09T09:03:43Z) - The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track [1.5993707490601146]
This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation.
We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit.
Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management.
arXiv Detail & Related papers (2024-10-29T19:07:50Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Recent Advances on Machine Learning for Computational Fluid Dynamics: A Survey [51.87875066383221]
This paper introduces fundamental concepts, traditional methods, and benchmark datasets, then examine the various roles Machine Learning plays in improving CFD.
We highlight real-world applications of ML for CFD in critical scientific and engineering disciplines, including aerodynamics, combustion, atmosphere & ocean science, biology fluid, plasma, symbolic regression, and reduced order modeling.
We draw the conclusion that ML is poised to significantly transform CFD research by enhancing simulation accuracy, reducing computational time, and enabling more complex analyses of fluid dynamics.
arXiv Detail & Related papers (2024-08-22T07:33:11Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning [1.8270184406083445]
We explore using large language models (LLM) and prompting strategies to automatically extract dimensions from documents.
Our approach could aid data publishers and practitioners in creating machine-readable documentation.
We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results.
arXiv Detail & Related papers (2024-04-04T10:09:28Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - The Open Review-Based (ORB) dataset: Towards Automatic Assessment of
Scientific Papers and Experiment Proposals in High-Energy Physics [0.0]
We introduce the new comprehensive Open Review-Based dataset (ORB)
It includes a curated list of more than 36,000 scientific papers with their more than 89,000 reviews and final decisions.
This paper presents our data architecture and an overview of the collected data along with relevant statistics.
arXiv Detail & Related papers (2023-11-29T20:52:02Z) - Large Language Models as Data Preprocessors [9.99065004972981]
Large Language Models (LLMs) have marked a significant advancement in artificial intelligence.
This study explores their potential in data preprocessing, a critical stage in data mining and analytics applications.
We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques.
arXiv Detail & Related papers (2023-08-30T23:28:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.