Healthsheet: Development of a Transparency Artifact for Health Datasets
- URL: http://arxiv.org/abs/2202.13028v1
- Date: Sat, 26 Feb 2022 01:05:55 GMT
- Title: Healthsheet: Development of a Transparency Artifact for Health Datasets
- Authors: Negar Rostamzadeh, Diana Mincu, Subhrajit Roy, Andrew Smart, Lauren
Wilcox, Mahima Pushkarna, Jessica Schrouff, Razvan Amironesei, Nyalleng
Moorosi, Katherine Heller
- Abstract summary: We introduce Healthsheet, a contextualized adaptation of the original questionnaire citegebru 2018datasheets for health-specific applications.
We work with three publicly-available healthcare datasets as our case studies.
- Score: 13.57051456780329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML) approaches have demonstrated promising results in a
wide range of healthcare applications. Data plays a crucial role in developing
ML-based healthcare systems that directly affect people's lives. Many of the
ethical issues surrounding the use of ML in healthcare stem from structural
inequalities underlying the way we collect, use, and handle data. Developing
guidelines to improve documentation practices regarding the creation, use, and
maintenance of ML healthcare datasets is therefore of critical importance. In
this work, we introduce Healthsheet, a contextualized adaptation of the
original datasheet questionnaire ~\cite{gebru2018datasheets} for
health-specific applications. Through a series of semi-structured interviews,
we adapt the datasheets for healthcare data documentation. As part of the
Healthsheet development process and to understand the obstacles researchers
face in creating datasheets, we worked with three publicly-available healthcare
datasets as our case studies, each with different types of structured data:
Electronic health Records (EHR), clinical trial study data, and
smartphone-based performance outcome measures. Our findings from the
interviewee study and case studies show 1) that datasheets should be
contextualized for healthcare, 2) that despite incentives to adopt
accountability practices such as datasheets, there is a lack of consistency in
the broader use of these practices 3) how the ML for health community views
datasheets and particularly \textit{Healthsheets} as diagnostic tool to surface
the limitations and strength of datasets and 4) the relative importance of
different fields in the datasheet to healthcare concerns.
Related papers
- Large Language Model Benchmarks in Medical Tasks [11.196196955468992]
This paper presents a survey of various benchmark datasets employed in medical large language models (LLMs) tasks.
The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs.
The paper emphasizes the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis.
arXiv Detail & Related papers (2024-10-28T11:07:33Z) - FEDMEKI: A Benchmark for Scaling Medical Foundation Models via Federated Knowledge Injection [83.54960238236548]
FEDMEKI not only preserves data privacy but also enhances the capability of medical foundation models.
FEDMEKI allows medical foundation models to learn from a broader spectrum of medical knowledge without direct data exposure.
arXiv Detail & Related papers (2024-08-17T15:18:56Z) - When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications? [8.89829757177796]
We examine the effectiveness of vector representations from last hidden states of Large Language Models for medical diagnostics and prognostics.
We focus on instruction-tuned LLMs in a zero-shot setting to represent abnormal physiological data and evaluate their utilities as feature extractors.
Although findings suggest the raw data features still prevails in medical ML tasks, zero-shot LLM embeddings demonstrate competitive results.
arXiv Detail & Related papers (2024-08-15T03:56:40Z) - A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry [2.1717945745027425]
Large Language Models (LLMs) have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation.
This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare.
Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness.
arXiv Detail & Related papers (2024-04-24T09:55:24Z) - The METRIC-framework for assessing data quality for trustworthy AI in
medicine: a systematic review [0.0]
Development of trustworthy AI is especially important in medicine.
We focus on the importance of data quality (training/test) in deep learning (DL)
We propose the METRIC-framework, a specialised data quality framework for medical training data.
arXiv Detail & Related papers (2024-02-21T09:15:46Z) - Clairvoyance: A Pipeline Toolkit for Medical Time Series [95.22483029602921]
Time-series learning is the bread and butter of data-driven *clinical decision support*
Clairvoyance proposes a unified, end-to-end, autoML-friendly pipeline that serves as a software toolkit.
Clairvoyance is the first to demonstrate viability of a comprehensive and automatable pipeline for clinical time-series ML.
arXiv Detail & Related papers (2023-10-28T12:08:03Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine [68.7814360102644]
We propose the Re$3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning.
We demonstrate the effectiveness of our method in generating patient discharge instructions.
arXiv Detail & Related papers (2022-10-23T16:34:39Z) - How to Leverage Multimodal EHR Data for Better Medical Predictions? [13.401754962583771]
The complexity of electronic health records ( EHR) data is a challenge for the application of deep learning.
In this paper, we first extract the accompanying clinical notes from EHR and propose a method to integrate these data.
The results on two medical prediction tasks show that our fused model with different data outperforms the state-of-the-art method.
arXiv Detail & Related papers (2021-10-29T13:26:05Z) - MIMO: Mutual Integration of Patient Journey and Medical Ontology for
Healthcare Representation Learning [49.57261599776167]
We propose an end-to-end robust Transformer-based solution, Mutual Integration of patient journey and Medical Ontology (MIMO) for healthcare representation learning and predictive analytics.
arXiv Detail & Related papers (2021-07-20T07:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.