In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review
- URL: http://arxiv.org/abs/2501.10727v1
- Date: Sat, 18 Jan 2025 11:03:59 GMT
- Title: In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review
- Authors: Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Sarah de Boer, Víctor M. Campello, Aasa Feragen, Enzo Ferrante, Melanie Ganz, Judy Wawira Gichoya, Camila González, Steff Groefsema, Alessa Hering, Adam Hulman, Leo Joskowicz, Dovile Juodelyte, Melih Kandemir, Thijs Kooi, Jorge del Pozo Lérida, Livie Yumeng Li, Andre Pacheco, Tim Rädsch, Mauricio Reyes, Théo Sourget, Bram van Ginneken, David Wen, Nina Weng, Jack Junchi Xu, Hubert Dariusz Zając, Maria A. Zuluaga, Veronika Cheplygina,
- Abstract summary: We propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications.
We discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle.
- Score: 18.178774133733686
- License:
- Abstract: Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static -- they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at http://130.226.140.142.
Related papers
- A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.
The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.
We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - Copycats: the many lives of a publicly available medical imaging dataset [12.98380178359767]
Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare.
MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace.
While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets.
arXiv Detail & Related papers (2024-02-09T12:01:22Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z) - Applied Deep Learning to Identify and Localize Polyps from Endoscopic
Images [0.0]
We have aimed at open sourcing a dataset which contains annotations of polyps and ulcers.
This is the first dataset that's coming from India containing polyp and ulcer images.
We evaluated our dataset with several popular deep learning object detection models that's trained on large publicly available datasets.
arXiv Detail & Related papers (2023-01-22T22:14:25Z) - Is More Data All You Need? A Causal Exploration [4.756600446882457]
Causal analysis is often used in medicine and economics to gain insights about the effects of actions and policies.
In this paper we explore the effect of dataset interventions on the output of image classification models.
arXiv Detail & Related papers (2022-06-06T08:02:54Z) - Self-Supervised Learning as a Means To Reduce the Need for Labeled Data
in Medical Image Analysis [64.4093648042484]
We use a dataset of chest X-ray images with bounding box labels for 13 different classes of anomalies.
We show that it is possible to achieve similar performance to a fully supervised model in terms of mean average precision and accuracy with only 60% of the labeled data.
arXiv Detail & Related papers (2022-06-01T09:20:30Z) - LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network
using Transformers for Cross-Modal Information Retrieval in Histopathology
Archives [0.7614628596146599]
Cross-modality data retrieval has become a requirement for many domains and disciplines of research.
This study proposes a novel architecture with a new loss term to help represent images and texts in the joint latent space.
arXiv Detail & Related papers (2022-03-02T22:42:20Z) - Application of DatasetGAN in medical imaging: preliminary studies [10.260087683496431]
Generative adversarial networks (GANs) have been widely investigated for many potential applications in medical imaging.
datasetGAN is a recently proposed framework based on modern GANs that can synthesize high-quality segmented images.
There are no published studies focusing on its applications to medical imaging.
arXiv Detail & Related papers (2022-02-27T22:03:20Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.