Related papers: Fidelity and Privacy of Synthetic Medical Data

Fidelity and Privacy of Synthetic Medical Data

URL: http://arxiv.org/abs/2101.08658v1
Date: Mon, 18 Jan 2021 23:01:27 GMT
Title: Fidelity and Privacy of Synthetic Medical Data
Authors: Ofer Mendelevitch, Michael D. Lesh
Abstract summary: The digitization of medical records ushered in a new era of big data to clinical science. The need to share individual-level medical data continues to grow, and has never been more urgent. enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The digitization of medical records ushered in a new era of big data to clinical science, and with it the possibility that data could be shared, to multiply insights beyond what investigators could abstract from paper records. The need to share individual-level medical data to accelerate innovation in precision medicine continues to grow, and has never been more urgent, as scientists grapple with the COVID-19 pandemic. However, enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy. That is, the ability to extract private or confidential information about an individual, in practice, renders it difficult to share data, since significant infrastructure and data governance must be established before data can be shared. Although HIPAA provided de-identification as an approved mechanism for data sharing, linkage attacks were identified as a major vulnerability. A variety of mechanisms have been established to avoid leaking private information, such as field suppression or abstraction, strictly limiting the amount of information that can be shared, or employing mathematical techniques such as differential privacy. Another approach, which we focus on here, is creating synthetic data that mimics the underlying data. For synthetic data to be a useful mechanism in support of medical innovation and a proxy for real-world evidence, one must demonstrate two properties of the synthetic dataset: (1) any analysis on the real data must be matched by analysis of the synthetic data (statistical fidelity) and (2) the synthetic data must preserve privacy, with minimal risk of re-identification (privacy guarantee). In this paper we propose a framework for quantifying the statistical fidelity and privacy preservation properties of synthetic datasets and demonstrate these metrics for synthetic data generated by Syntegra technology.

Related papers

A Consensus Privacy Metrics Framework for Synthetic Data [13.972528788909813]
There is no consolidated standard for measuring privacy in synthetic data. Our findings indicate that current similarity metrics fail to measure identity disclosure. For differentially private synthetic data, a privacy budget other than close to zero was not considered interpretable.
arXiv Detail & Related papers (2025-03-06T21:19:02Z)
Chemical knowledge-informed framework for privacy-aware retrosynthesis learning [60.93245342663455]
Current machine learning-based retrosynthesis gathers reaction data from multiple sources into one single edge to train prediction models. This paradigm poses considerable privacy risks as it necessitates broad data availability across organizational boundaries. In the present study, we introduce the chemical knowledge-informed framework (CKIF), a privacy-preserving approach for learning retrosynthesis models.
arXiv Detail & Related papers (2025-02-26T13:13:24Z)
End to End Collaborative Synthetic Data Generation [5.399800035598186]
We propose an end-to-end collaborative framework for publishing of synthetic data. We instantiate this framework with Secure Multiparty Computation (MPC) protocols and evaluate it in a use case for privacy-preserving publishing of synthetic genomic data for leukemia.
arXiv Detail & Related papers (2024-12-04T23:10:51Z)
Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. RAG systems may face severe privacy risks when retrieving private data. We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics [0.412484724941528]
Privacy poses a significant obstacle to the progress of learning analytics (LA), presenting challenges like inadequate anonymization and data misuse. Synthetic data emerges as a potential remedy, offering robust privacy protection. Prior LA research on synthetic data lacks thorough evaluation, essential for assessing the delicate balance between privacy and data utility.
arXiv Detail & Related papers (2024-01-12T20:27:55Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development [0.6906005491572401]
This paper investigates the policies governing the creation, utilization, and dissemination of synthetic data. A well crafted synthetic data policy must strike a balance between privacy concerns and the utility of data.
arXiv Detail & Related papers (2023-08-31T23:18:53Z)
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Synthetic Data: Methods, Use Cases, and Risks [11.413309528464632]
A possible alternative gaining momentum in both the research community and industry is to share synthetic data instead. We provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology.
arXiv Detail & Related papers (2023-03-01T16:35:33Z)
Measuring Utility and Privacy of Synthetic Genomic Data [3.635321290763711]
We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data. Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
arXiv Detail & Related papers (2021-02-05T17:41:01Z)
Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems. In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)
Really Useful Synthetic Data -- A Framework to Evaluate the Quality of Differentially Private Synthetic Data [2.538209532048867]
Recent advances in generating synthetic data that allow to add principled ways of protecting privacy are a crucial step in sharing statistical information in a privacy preserving way. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. We develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective.
arXiv Detail & Related papers (2020-04-16T16:24:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.