Datasheets for AI and medical datasets (DAIMS): a data validation and documentation framework before machine learning analysis in medical research
- URL: http://arxiv.org/abs/2501.14094v1
- Date: Thu, 23 Jan 2025 21:02:56 GMT
- Title: Datasheets for AI and medical datasets (DAIMS): a data validation and documentation framework before machine learning analysis in medical research
- Authors: Ramtin Zargari Marandi, Anne Svane Frahm, Maja Milojevic,
- Abstract summary: We extend the framework to "Datasheets for AI and medical datasets - DAIMS"
Our publicly available solution, DAIMS, provides a checklist including data standardization requirements.
The checklist consists of 24 common data standardization requirements, where the tool checks and validate a subset of them.
- Score: 0.0
- License:
- Abstract: Despite progresses in data engineering, there are areas with limited consistencies across data validation and documentation procedures causing confusions and technical problems in research involving machine learning. There have been progresses by introducing frameworks like "Datasheets for Datasets", however there are areas for improvements to prepare datasets, ready for ML pipelines. Here, we extend the framework to "Datasheets for AI and medical datasets - DAIMS." Our publicly available solution, DAIMS, provides a checklist including data standardization requirements, a software tool to assist the process of the data preparation, an extended form for data documentation and pose research questions, a table as data dictionary, and a flowchart to suggest ML analyses to address the research questions. The checklist consists of 24 common data standardization requirements, where the tool checks and validate a subset of them. In addition, we provided a flowchart mapping research questions to suggested ML methods. DAIMS can serve as a reference for standardizing datasets and a roadmap for researchers aiming to apply effective ML techniques in their medical research endeavors. DAIMS is available on GitHub and as an online app to automate key aspects of dataset evaluation, facilitating efficient preparation of datasets for ML studies.
Related papers
- The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track [1.5993707490601146]
This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation.
We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit.
Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management.
arXiv Detail & Related papers (2024-10-29T19:07:50Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - CSMeD: Bridging the Dataset Gap in Automated Citation Screening for
Systematic Literature Reviews [10.207938863784829]
We introduce CSMeD, a meta-dataset consolidating nine publicly released collections.
CSMeD serves as a comprehensive resource for training and evaluating the performance of automated citation screening models.
We introduce CSMeD-FT, a new dataset designed explicitly for evaluating the full text publication screening task.
arXiv Detail & Related papers (2023-11-21T09:36:11Z) - Utilising a Large Language Model to Annotate Subject Metadata: A Case
Study in an Australian National Research Data Catalogue [18.325675189960833]
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research.
As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them.
This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
arXiv Detail & Related papers (2023-10-17T14:52:33Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - PyTrial: Machine Learning Software and Benchmark for Clinical Trial
Applications [49.69824178329405]
PyTrial provides benchmarks and open-source implementations of a series of machine learning algorithms for clinical trial design and operations.
We thoroughly investigate 34 ML algorithms for clinical trials across 6 different tasks, including patient outcome prediction, trial site selection, trial outcome prediction, patient-trial matching, trial similarity search, and synthetic data generation.
PyTrial defines each task through a simple four-step process: data loading, model specification, model training, and model evaluation, all achievable with just a few lines of code.
arXiv Detail & Related papers (2023-06-06T21:19:03Z) - Weakly Supervised Anomaly Detection: A Survey [75.26180038443462]
Anomaly detection (AD) is a crucial task in machine learning with various applications.
We present the first comprehensive survey of weakly supervised anomaly detection (WSAD) methods.
For each setting, we provide formal definitions, key algorithms, and potential future directions.
arXiv Detail & Related papers (2023-02-09T10:27:21Z) - Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata [10.689661834716613]
Data is central to the development and evaluation of machine learning (ML) models.
To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation.
There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
arXiv Detail & Related papers (2022-06-06T21:55:39Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - Big Machinery Data Preprocessing Methodology for Data-Driven Models in
Prognostics and Health Management [0.0]
This paper presents a comprehensive, step-by-step pipeline for the preprocessing of monitoring data from complex systems.
The importance of expert knowledge is discussed in the context of data selection and label generation.
Two case studies are presented for validation, with the end goal of creating clean data sets with healthy and unhealthy labels.
arXiv Detail & Related papers (2021-10-08T17:10:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.