Can I trust my fake data -- A comprehensive quality assessment framework
for synthetic tabular data in healthcare
- URL: http://arxiv.org/abs/2401.13716v1
- Date: Wed, 24 Jan 2024 08:14:20 GMT
- Title: Can I trust my fake data -- A comprehensive quality assessment framework
for synthetic tabular data in healthcare
- Authors: Vibeke Binz Vallevik, Aleksandar Babic, Serena Elizabeth Marshall,
Severin Elvatun, Helga Br{\o}gger, Sharmini Alagaratnam, Bj{\o}rn Edwin,
Narasimha Raghavan Veeraragavan, Anne Kjersti Befring, Jan Franz Nyg{\aa}rd
- Abstract summary: In response to privacy concerns and regulatory requirements, using synthetic data has been suggested.
We present a conceptual framework for quality assurance of SD for AI applications in healthcare.
We propose stages necessary to support real-life applications.
- Score: 33.855237079128955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring safe adoption of AI tools in healthcare hinges on access to
sufficient data for training, testing and validation. In response to privacy
concerns and regulatory requirements, using synthetic data has been suggested.
Synthetic data is created by training a generator on real data to produce a
dataset with similar statistical properties. Competing metrics with differing
taxonomies for quality evaluation have been suggested, resulting in a complex
landscape. Optimising quality entails balancing considerations that make the
data fit for use, yet relevant dimensions are left out of existing frameworks.
We performed a comprehensive literature review on the use of quality evaluation
metrics on SD within the scope of tabular healthcare data and SD made using
deep generative methods. Based on this and the collective team experiences, we
developed a conceptual framework for quality assurance. The applicability was
benchmarked against a practical case from the Dutch National Cancer Registry.
We present a conceptual framework for quality assurance of SD for AI
applications in healthcare that aligns diverging taxonomies, expands on common
quality dimensions to include the dimensions of Fairness and Carbon footprint,
and proposes stages necessary to support real-life applications. Building trust
in synthetic data by increasing transparency and reducing the safety risk will
accelerate the development and uptake of trustworthy AI tools for the benefit
of patients. Despite the growing emphasis on algorithmic fairness and carbon
footprint, these metrics were scarce in the literature review. The overwhelming
focus was on statistical similarity using distance metrics while sequential
logic detection was scarce. A consensus-backed framework that includes all
relevant quality dimensions can provide assurance for safe and responsible
real-life applications of SD.
Related papers
- Scorecards for Synthetic Medical Data Evaluation and Reporting [2.8262986891348056]
The growing utilization of synthetic medical data (SMD) in training and testing AI-driven tools in healthcare requires a systematic framework for assessing its quality.
Here, we outline an evaluation framework designed to meet the unique requirements of medical applications.
We introduce the concept of scorecards, which can serve as comprehensive reports that accompany artificially generated datasets.
arXiv Detail & Related papers (2024-06-17T02:11:59Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [52.5766244206855]
This paper challenges cutting-edge generative models to automatically synthesize data for assessing reliability in semantic segmentation.
By fine-tuning Stable Diffusion, we perform zero-shot generation of synthetic data in OOD domains or inpainted with OOD objects.
We demonstrate a high correlation between the performance on synthetic data and the performance on real OOD data, showing the validity approach.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z) - Privacy-Preserving Medical Image Classification through Deep Learning
and Matrix Decomposition [0.0]
Deep learning (DL) solutions have been extensively researched in the medical domain in recent years.
The usage of health-related data is strictly regulated, processing medical records outside the hospital environment demands robust data protection measures.
In this paper, we use singular value decomposition (SVD) and principal component analysis (PCA) to obfuscate the medical images before employing them in the DL analysis.
The capability of DL algorithms to extract relevant information from secured data is assessed on a task of angiographic view classification based on obfuscated frames.
arXiv Detail & Related papers (2023-08-31T08:21:09Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive
Survey [6.277848092408045]
Data quality is the key factor for the development of trustworthy AI in healthcare.
Access to good quality datasets is limited by the technical difficulty of data acquisition.
Large-scale sharing of healthcare data is hindered by strict ethical restrictions.
arXiv Detail & Related papers (2022-09-17T13:34:17Z) - Enabling Synthetic Data adoption in regulated domains [1.9512796489908306]
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms.
In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for.
A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties.
arXiv Detail & Related papers (2022-04-13T10:53:54Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.