Related papers: Enabling Synthetic Data adoption in regulated domains

Enabling Synthetic Data adoption in regulated domains

URL: http://arxiv.org/abs/2204.06297v1
Date: Wed, 13 Apr 2022 10:53:54 GMT
Title: Enabling Synthetic Data adoption in regulated domains
Authors: Giorgio Visani, Giacomo Graffi, Mattia Alfero, Enrico Bagli, Davide Capuzzo, Federico Chesani
Abstract summary: The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms. In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for. A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties.
Score: 1.9512796489908306
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms, bringing forward new challenges. In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for. Specific approaches to address the privacy issue have been developed, as Privacy Enhancing Technologies. However, they frequently cause loss of information, putting forward a crucial trade-off among data quality and privacy. A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties. Both Academia and Industry realized the importance of evaluating synthetic data quality: without all-round reliable metrics, the innovative data generation task has no proper objective function to maximize. Despite that, the topic remains under-explored. For this reason, we systematically catalog the important traits of synthetic data quality and privacy, and devise a specific methodology to test them. The result is DAISYnt (aDoption of Artificial Intelligence SYnthesis): a comprehensive suite of advanced tests, which sets a de facto standard for synthetic data evaluation. As a practical use-case, a variety of generative algorithms have been trained on real-world Credit Bureau Data. The best model has been assessed, using DAISYnt on the different synthetic replicas. Further potential uses, among others, entail auditing and fine-tuning of generative models or ensuring high quality of a given synthetic dataset. From a prescriptive viewpoint, eventually, DAISYnt may pave the way to synthetic data adoption in highly regulated domains, ranging from Finance to Healthcare, through Insurance and Education.

Related papers

Automatic Prompt Optimization Techniques: Exploring the Potential for Synthetic Data Generation [0.0]
In specialized domains such as healthcare, data acquisition faces significant constraints due to privacy regulations, ethical considerations, and limited availability. The emergence of large-scale prompt-based models presents new opportunities for synthetic data generation without direct access to protected data. We review recent developments in automatic prompt optimization, following PRISMA guidelines.
arXiv Detail & Related papers (2025-02-05T11:13:03Z)
Synthetic Data Privacy Metrics [2.1213500139850017]
We review the pros and cons of popular metrics that include simulations of adversarial attacks. We also review current best practices for amending generative models to enhance the privacy of the data they create.
arXiv Detail & Related papers (2025-01-07T17:02:33Z)
Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z)
KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation [0.0]
Generative Deep Learning models struggle to model discrete and non-Gaussian features that have domain constraints. Generative models create synthetic data that repeats sensitive features, which is a privacy risk. This paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation.
arXiv Detail & Related papers (2024-09-25T19:50:03Z)
Generation of synthetic data using breast cancer dataset and classification with resnet18 [0.0]
Synthetic data is required for a number of reasons, including the constraints of real data, the expense of collecting labeled data, and privacy and security problems. A deep learning model called GAN (Generative Adversarial Networks) has been developed with the intention of generating synthetic data. In this study, the Breast Histopathology dataset was used to generate malignant and negatively labeled synthetic patch images.
arXiv Detail & Related papers (2024-05-25T15:53:27Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics [0.412484724941528]
Privacy poses a significant obstacle to the progress of learning analytics (LA), presenting challenges like inadequate anonymization and data misuse. Synthetic data emerges as a potential remedy, offering robust privacy protection. Prior LA research on synthetic data lacks thorough evaluation, essential for assessing the delicate balance between privacy and data utility.
arXiv Detail & Related papers (2024-01-12T20:27:55Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development [0.6906005491572401]
This paper investigates the policies governing the creation, utilization, and dissemination of synthetic data. A well crafted synthetic data policy must strike a balance between privacy concerns and the utility of data.
arXiv Detail & Related papers (2023-08-31T23:18:53Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process. We generate a representative as well as fair version of the UCI Adult census data set. We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
Really Useful Synthetic Data -- A Framework to Evaluate the Quality of Differentially Private Synthetic Data [2.538209532048867]
Recent advances in generating synthetic data that allow to add principled ways of protecting privacy are a crucial step in sharing statistical information in a privacy preserving way. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. We develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective.
arXiv Detail & Related papers (2020-04-16T16:24:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.