The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular
Dataset of 3.5M Screening and Diagnostic Mammograms
- URL: http://arxiv.org/abs/2202.04073v1
- Date: Tue, 8 Feb 2022 14:40:59 GMT
- Title: The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular
Dataset of 3.5M Screening and Diagnostic Mammograms
- Authors: Jiwoong J. Jeong, Brianna L. Vey, Ananth Reddy, Thomas Kim, Thiago
Santos, Ramon Correa, Raman Dutt, Marina Mosunjac, Gabriela Oprea-Ilies,
Geoffrey Smith, Minjae Woo, Christopher R. McAdams, Mary S. Newell, Imon
Banerjee, Judy Gichoya, Hari Trivedi
- Abstract summary: The EMory BrEast imaging dataset contains 3650,000 2D and diagnostic mammograms for 116,000 women divided equally between White and African American patients.
Our goal is to share this dataset with research partners to aid in development and validation of breast AI models that will serve all patients fairly and help decrease bias in medical AI.
- Score: 2.243792799100692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing and validating artificial intelligence models in medical imaging
requires datasets that are large, granular, and diverse. To date, the majority
of publicly available breast imaging datasets lack in one or more of these
areas. Models trained on these data may therefore underperform on patient
populations or pathologies that have not previously been encountered. The EMory
BrEast imaging Dataset (EMBED) addresses these gaps by providing 3650,000 2D
and DBT screening and diagnostic mammograms for 116,000 women divided equally
between White and African American patients. The dataset also contains 40,000
annotated lesions linked to structured imaging descriptors and 61 ground truth
pathologic outcomes grouped into six severity classes. Our goal is to share
this dataset with research partners to aid in development and validation of
breast AI models that will serve all patients fairly and help decrease bias in
medical AI.
Related papers
- Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE.
We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z) - Demographic Bias of Expert-Level Vision-Language Foundation Models in
Medical Imaging [13.141767097232796]
Self-supervised vision-language foundation models can detect a broad spectrum of pathologies without relying on explicit training annotations.
It is crucial to ensure that these AI models do not mirror or amplify human biases, thereby disadvantaging historically marginalized groups such as females or Black patients.
This study investigates the algorithmic fairness of state-of-the-art vision-language foundation models in chest X-ray diagnosis across five globally-sourced datasets.
arXiv Detail & Related papers (2024-02-22T18:59:53Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Generative models improve fairness of medical classifiers under
distribution shifts [49.10233060774818]
We show that learning realistic augmentations automatically from data is possible in a label-efficient manner using generative models.
We demonstrate that these learned augmentations can surpass ones by making models more robust and statistically fair in- and out-of-distribution.
arXiv Detail & Related papers (2023-04-18T18:15:38Z) - Diffusion Probabilistic Models beat GANs on Medical Images [0.13386555802329278]
We propose Medfusion, a conditional latent DDPM for medical images.
We compare our DDPM-based model against GAN-based models, which constitute the current state-of-the-art in the medical domain.
Our study shows that DDPM are a superior alternative to GANs for image synthesis in the medical domain.
arXiv Detail & Related papers (2022-12-14T20:46:50Z) - Federated Learning Enables Big Data for Rare Cancer Boundary Detection [98.5549882883963]
We present findings from the largest Federated ML study to-date, involving data from 71 healthcare institutions across 6 continents.
We generate an automatic tumor boundary detector for the rare disease of glioblastoma.
We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent.
arXiv Detail & Related papers (2022-04-22T17:27:00Z) - Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in
Artificial Intelligence [79.038671794961]
We launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution.
Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK.
arXiv Detail & Related papers (2021-11-18T00:43:41Z) - Detection of masses and architectural distortions in digital breast
tomosynthesis: a publicly available dataset of 5,060 patients and a deep
learning model [4.3359550072619255]
We have curated and made publicly available a large-scale dataset of digital breast tomosynthesis images.
It contains 22,032 reconstructed volumes belonging to 5,610 studies from 5,060 patients.
We developed a single-phase deep learning detection model and tested it using our dataset to serve as a baseline for future research.
arXiv Detail & Related papers (2020-11-13T18:33:31Z) - OPTIMAM Mammography Image Database: a large scale resource of
mammography images and clinical data [0.2600410195810869]
A major barrier to medical imaging research is a lack of large databases of medical images which share images with other researchers.
The OPTIMAM image database (OMI-DB) has been developed to overcome these barriers.
The database contains over 2.5 million images from 173,319 women collected from three UK breast screening centres.
arXiv Detail & Related papers (2020-04-09T17:12:13Z) - Heterogeneity Loss to Handle Intersubject and Intrasubject Variability
in Cancer [11.440201348567681]
Deep learning (DL) models have shown impressive results in medical domain.
These AI methods can provide immense support to developing nations as affordable healthcare solutions.
This work is focused on one such application of blood cancer diagnosis.
arXiv Detail & Related papers (2020-03-06T16:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.