Related papers: Can I trust my fake data -- A comprehensive quality assessment framework for synthetic tabular data in healthcare

Can I trust my fake data -- A comprehensive quality assessment framework for synthetic tabular data in healthcare

URL: http://arxiv.org/abs/2401.13716v1
Date: Wed, 24 Jan 2024 08:14:20 GMT
Title: Can I trust my fake data -- A comprehensive quality assessment framework for synthetic tabular data in healthcare
Authors: Vibeke Binz Vallevik, Aleksandar Babic, Serena Elizabeth Marshall, Severin Elvatun, Helga Br{\o}gger, Sharmini Alagaratnam, Bj{\o}rn Edwin, Narasimha Raghavan Veeraragavan, Anne Kjersti Befring, Jan Franz Nyg{\aa}rd
Abstract summary: In response to privacy concerns and regulatory requirements, using synthetic data has been suggested. We present a conceptual framework for quality assurance of SD for AI applications in healthcare. We propose stages necessary to support real-life applications.
Score: 33.855237079128955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. In response to privacy concerns and regulatory requirements, using synthetic data has been suggested. Synthetic data is created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been suggested, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. We performed a comprehensive literature review on the use of quality evaluation metrics on SD within the scope of tabular healthcare data and SD made using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. We present a conceptual framework for quality assurance of SD for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of SD.

Related papers

DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction [0.0]
This paper presents a hybrid de-identification framework that combines rule-based and AI-driven techniques.<n>Our solution addresses critical challenges in medical data de-identification and supports the secure, ethical, and trustworthy release of imaging data for research.
arXiv Detail & Related papers (2025-07-31T17:19:38Z)
Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework [0.4874819476581695]
evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. We present a framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy.
arXiv Detail & Related papers (2025-04-02T17:10:30Z)
Requirements for Quality Assurance of AI Models for Early Detection of Lung Cancer [0.5801420352256208]
Lung cancer is the second most common cancer and the leading cause of cancer-related deaths worldwide. Under the EU AI Act, consistent quality assurance is required for AI-based nodule detection, measurement, and characterization. This position paper proposes systematic quality assurance grounded in a validated reference dataset.
arXiv Detail & Related papers (2025-02-24T20:38:29Z)
Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare [9.381558154295012]
Segment Anything Model (SAM) excels in intelligent image segmentation. SAM poses significant challenges for deployment on resource-limited edge devices. We propose a data-free quantization framework for SAM, called DFQ-SAM, which learns and calibrates quantization parameters without any original data.
arXiv Detail & Related papers (2024-09-14T10:43:35Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Scorecards for Synthetic Medical Data Evaluation and Reporting [2.8262986891348056]
The growing utilization of synthetic medical data (SMD) in training and testing AI-driven tools in healthcare requires a systematic framework for assessing its quality. Here, we outline an evaluation framework designed to meet the unique requirements of medical applications. We introduce the concept of scorecards, which can serve as comprehensive reports that accompany artificially generated datasets.
arXiv Detail & Related papers (2024-06-17T02:11:59Z)
Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. This synthetic data is employed to evaluate the robustness of pretrained segmenters. We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z)
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z)
Privacy-Preserving Medical Image Classification through Deep Learning and Matrix Decomposition [0.0]
Deep learning (DL) solutions have been extensively researched in the medical domain in recent years. The usage of health-related data is strictly regulated, processing medical records outside the hospital environment demands robust data protection measures. In this paper, we use singular value decomposition (SVD) and principal component analysis (PCA) to obfuscate the medical images before employing them in the DL analysis. The capability of DL algorithms to extract relevant information from secured data is assessed on a task of angiographic view classification based on obfuscated frames.
arXiv Detail & Related papers (2023-08-31T08:21:09Z)
QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality. We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z)
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z)
Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive Survey [6.277848092408045]
Data quality is the key factor for the development of trustworthy AI in healthcare. Access to good quality datasets is limited by the technical difficulty of data acquisition. Large-scale sharing of healthcare data is hindered by strict ethical restrictions.
arXiv Detail & Related papers (2022-09-17T13:34:17Z)
Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging. We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets. We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.