Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds
- URL: http://arxiv.org/abs/2602.07149v1
- Date: Fri, 06 Feb 2026 19:47:10 GMT
- Title: Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds
- Authors: Rawisara Lohanimit, Yankun Wu, Amelia Katirai, Yuta Nakashima, Noa Garcia,
- Abstract summary: This work explores the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online.<n>We retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations.<n>Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation.
- Score: 18.713340629300102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.
Related papers
- Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models [62.979954692036685]
We introduce PRSS, which refines the classifier-free guidance approach in diffusion models by integrating prompt re-anchoring and semantic prompt search.<n>Our approach consistently improves the privacy-utility trade-off, establishing a new state-of-the-art.
arXiv Detail & Related papers (2025-04-25T02:51:23Z) - De-Identification of Medical Imaging Data: A Comprehensive Tool for Ensuring Patient Privacy [4.376648893167674]
Open-source tool can be used to de-identify DICOM magnetic resonance images, computer images, whole slide images and magnetic resonance twix raw data.
Proposal comprises an elaborate anonymization pipeline for multiple types of inputs, reducing the need for additional tools used for de-identification of imaging data.
arXiv Detail & Related papers (2024-10-16T09:31:24Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - Removing confounding information from fetal ultrasound images [1.6624933615451838]
Confounding information in the form of text or markings embedded in medical images can severely affect the training of diagnostic deep learning algorithms.
In dermatology, known examples include drawings or rulers that are overrepresented in images of malignant lesions.
In this paper, we encounter text and calipers placed on the images found in national databases containing fetal screening ultrasound scans.
arXiv Detail & Related papers (2023-03-24T11:13:33Z) - Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging [47.99192239793597]
We evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training.
Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
arXiv Detail & Related papers (2023-02-03T09:49:13Z) - ConfounderGAN: Protecting Image Data Privacy with Causal Confounder [85.6757153033139]
We propose ConfounderGAN, a generative adversarial network (GAN) that can make personal image data unlearnable to protect the data privacy of its owners.
Experiments are conducted in six image classification datasets, consisting of three natural object datasets and three medical datasets.
arXiv Detail & Related papers (2022-12-04T08:49:14Z) - OdontoAI: A human-in-the-loop labeled data set and an online platform to
boost research on dental panoramic radiographs [53.67409169790872]
This study addresses the construction of a public data set of dental panoramic radiographs.
We benefit from the human-in-the-loop (HITL) concept to expedite the labeling procedure.
Results demonstrate a 51% labeling time reduction using HITL, saving us more than 390 continuous working hours.
arXiv Detail & Related papers (2022-03-29T18:57:23Z) - Voice-assisted Image Labelling for Endoscopic Ultrasound Classification
using Neural Networks [48.732863591145964]
We propose a multi-modal convolutional neural network architecture that labels endoscopic ultrasound (EUS) images from raw verbal comments provided by a clinician during the procedure.
Our results show a prediction accuracy of 76% at image level on a dataset with 5 different labels.
arXiv Detail & Related papers (2021-10-12T21:22:24Z) - Personalized Image Semantic Segmentation [58.980245748434]
We generate more accurate segmentation results on unlabeled personalized images by investigating the data's personalized traits.
We propose a baseline method that incorporates the inter-image context when segmenting certain images.
The code and the PIS dataset will be made publicly available.
arXiv Detail & Related papers (2021-07-24T04:03:11Z) - A Deep Learning Approach to Private Data Sharing of Medical Images Using
Conditional GANs [1.2099130772175573]
We present a method for generating a synthetic dataset based on COSENTYX (secukinumab) Ankylosing Spondylitis clinical study.
In this paper, we present a method for generating a synthetic dataset and conduct an in-depth analysis on its properties of along three key metrics: image fidelity, sample diversity and dataset privacy.
arXiv Detail & Related papers (2021-06-24T17:24:06Z) - Privacy-Preserving Image Classification in the Local Setting [17.375582978294105]
Local Differential Privacy (LDP) brings us a promising solution, which allows the data owners to randomly perturb their input to provide the plausible deniability of the data before releasing.
In this paper, we consider a two-party image classification problem, in which data owners hold the image and the untrustworthy data user would like to fit a machine learning model with these images as input.
We propose a supervised image feature extractor, DCAConv, which produces an image representation with scalable domain size.
arXiv Detail & Related papers (2020-02-09T01:25:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.