Related papers: Copycats: the many lives of a publicly available medical imaging dataset

Copycats: the many lives of a publicly available medical imaging dataset

URL: http://arxiv.org/abs/2402.06353v3
Date: Wed, 30 Oct 2024 15:40:11 GMT
Title: Copycats: the many lives of a publicly available medical imaging dataset
Authors: Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Théo Sourget, Caroline Vang-Larsen, Anna Rogers, Hubert Dariusz Zając, Veronika Cheplygina,
Abstract summary: Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets.
Score: 12.98380178359767
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets' context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.

Related papers

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z)
Advancing Medical Representation Learning Through High-Quality Data [14.522284057070395]
We introduce Open-PMC, a high-quality medical dataset from PubMed Central. In-text references provide richer medical context, extending beyond the abstract information typically found in captions. We benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks.
arXiv Detail & Related papers (2025-03-18T16:10:11Z)
In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review [18.178774133733686]
We propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. We discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle.
arXiv Detail & Related papers (2025-01-18T11:03:59Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets [19.128392861461297]
We conduct meticulous analyses of two popular dermatological image datasets: DermaMNIST and Fitzpatrick17k. We uncover data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets.
arXiv Detail & Related papers (2024-01-25T20:29:01Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets [17.774341783844026]
This work proposes Multimodal Integration of Oncology Data System (MINDS) MINDS is a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability.
arXiv Detail & Related papers (2023-09-30T15:44:39Z)
Privacy-Preserving Graph Machine Learning from Data to Computation: A Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning. We first review methods for generating privacy-preserving graph data. Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z)
Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive Survey [6.277848092408045]
Data quality is the key factor for the development of trustworthy AI in healthcare. Access to good quality datasets is limited by the technical difficulty of data acquisition. Large-scale sharing of healthcare data is hindered by strict ethical restrictions.
arXiv Detail & Related papers (2022-09-17T13:34:17Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender. We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z)
A Real Use Case of Semi-Supervised Learning for Mammogram Classification in a Local Clinic of Costa Rica [0.5541644538483946]
Training a deep learning model requires a considerable amount of labeled images. A number of publicly available datasets have been built with data from different hospitals and clinics. The use of the semi-supervised deep learning approach known as MixMatch, to leverage the usage of unlabeled data is proposed and evaluated.
arXiv Detail & Related papers (2021-07-24T22:26:50Z)
On the Composition and Limitations of Publicly Available COVID-19 X-Ray Imaging Datasets [0.0]
Data scarcity, mismatch between training and target population, group imbalance, and lack of documentation are important sources of bias. This paper presents an overview of the currently public available COVID-19 chest X-ray datasets.
arXiv Detail & Related papers (2020-08-26T14:16:01Z)
Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges. We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories. Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.