Can I trust my fake data -- A comprehensive quality assessment framework
for synthetic tabular data in healthcare
- URL: http://arxiv.org/abs/2401.13716v1
- Date: Wed, 24 Jan 2024 08:14:20 GMT
- Title: Can I trust my fake data -- A comprehensive quality assessment framework
for synthetic tabular data in healthcare
- Authors: Vibeke Binz Vallevik, Aleksandar Babic, Serena Elizabeth Marshall,
Severin Elvatun, Helga Br{\o}gger, Sharmini Alagaratnam, Bj{\o}rn Edwin,
Narasimha Raghavan Veeraragavan, Anne Kjersti Befring, Jan Franz Nyg{\aa}rd
- Abstract summary: In response to privacy concerns and regulatory requirements, using synthetic data has been suggested.
We present a conceptual framework for quality assurance of SD for AI applications in healthcare.
We propose stages necessary to support real-life applications.
- Score: 33.855237079128955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring safe adoption of AI tools in healthcare hinges on access to
sufficient data for training, testing and validation. In response to privacy
concerns and regulatory requirements, using synthetic data has been suggested.
Synthetic data is created by training a generator on real data to produce a
dataset with similar statistical properties. Competing metrics with differing
taxonomies for quality evaluation have been suggested, resulting in a complex
landscape. Optimising quality entails balancing considerations that make the
data fit for use, yet relevant dimensions are left out of existing frameworks.
We performed a comprehensive literature review on the use of quality evaluation
metrics on SD within the scope of tabular healthcare data and SD made using
deep generative methods. Based on this and the collective team experiences, we
developed a conceptual framework for quality assurance. The applicability was
benchmarked against a practical case from the Dutch National Cancer Registry.
We present a conceptual framework for quality assurance of SD for AI
applications in healthcare that aligns diverging taxonomies, expands on common
quality dimensions to include the dimensions of Fairness and Carbon footprint,
and proposes stages necessary to support real-life applications. Building trust
in synthetic data by increasing transparency and reducing the safety risk will
accelerate the development and uptake of trustworthy AI tools for the benefit
of patients. Despite the growing emphasis on algorithmic fairness and carbon
footprint, these metrics were scarce in the literature review. The overwhelming
focus was on statistical similarity using distance metrics while sequential
logic detection was scarce. A consensus-backed framework that includes all
relevant quality dimensions can provide assurance for safe and responsible
real-life applications of SD.
Related papers
- Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare [9.381558154295012]
Segment Anything Model (SAM) excels in intelligent image segmentation.
SAM poses significant challenges for deployment on resource-limited edge devices.
We propose a data-free quantization framework for SAM, called DFQ-SAM, which learns and calibrates quantization parameters without any original data.
arXiv Detail & Related papers (2024-09-14T10:43:35Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Scorecards for Synthetic Medical Data Evaluation and Reporting [2.8262986891348056]
The growing utilization of synthetic medical data (SMD) in training and testing AI-driven tools in healthcare requires a systematic framework for assessing its quality.
Here, we outline an evaluation framework designed to meet the unique requirements of medical applications.
We introduce the concept of scorecards, which can serve as comprehensive reports that accompany artificially generated datasets.
arXiv Detail & Related papers (2024-06-17T02:11:59Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z) - Privacy-Preserving Medical Image Classification through Deep Learning
and Matrix Decomposition [0.0]
Deep learning (DL) solutions have been extensively researched in the medical domain in recent years.
The usage of health-related data is strictly regulated, processing medical records outside the hospital environment demands robust data protection measures.
In this paper, we use singular value decomposition (SVD) and principal component analysis (PCA) to obfuscate the medical images before employing them in the DL analysis.
The capability of DL algorithms to extract relevant information from secured data is assessed on a task of angiographic view classification based on obfuscated frames.
arXiv Detail & Related papers (2023-08-31T08:21:09Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive
Survey [6.277848092408045]
Data quality is the key factor for the development of trustworthy AI in healthcare.
Access to good quality datasets is limited by the technical difficulty of data acquisition.
Large-scale sharing of healthcare data is hindered by strict ethical restrictions.
arXiv Detail & Related papers (2022-09-17T13:34:17Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.