Permissioned Blockchain-based Framework for Ranking Synthetic Data Generators
- URL: http://arxiv.org/abs/2405.07196v1
- Date: Sun, 12 May 2024 07:46:00 GMT
- Title: Permissioned Blockchain-based Framework for Ranking Synthetic Data Generators
- Authors: Narasimha Raghavan Veeraragavan, Mohammad Hossein Tabatabaei, Severin Elvatun, Vibeke Binz Vallevik, Siri Larønningen, Jan F Nygård,
- Abstract summary: We introduce a novel approach in which the proposed ranking algorithm is implemented as a smart contract within a permissioned blockchain framework called Sawtooth.
Our framework demonstrates its effectiveness in providing nuanced rankings that consider both desirable and undesirable properties.
- Score: 0.5541644538483947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic data generation is increasingly recognized as a crucial solution to address data related challenges such as scarcity, bias, and privacy concerns. As synthetic data proliferates, the need for a robust evaluation framework to select a synthetic data generator becomes more pressing given the variety of options available. In this research study, we investigate two primary questions: 1) How can we select the most suitable synthetic data generator from a set of options for a specific purpose? 2) How can we make the selection process more transparent, accountable, and auditable? To address these questions, we introduce a novel approach in which the proposed ranking algorithm is implemented as a smart contract within a permissioned blockchain framework called Sawtooth. Through comprehensive experiments and comparisons with state-of-the-art baseline ranking solutions, our framework demonstrates its effectiveness in providing nuanced rankings that consider both desirable and undesirable properties. Furthermore, our framework serves as a valuable tool for selecting the optimal synthetic data generators for specific needs while ensuring compliance with data protection principles.
Related papers
- Data Generation via Latent Factor Simulation for Fairness-aware Re-ranking [11.133319460036082]
Synthetic data is a useful resource for algorithmic research.
We propose a novel type of data for fairness-aware recommendation: synthetic recommender system outputs.
arXiv Detail & Related papers (2024-09-21T09:13:50Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - A supervised generative optimization approach for tabular data [2.5311562666866494]
This work presents a novel synthetic data generation framework.
It integrates a supervised component tailored to the specific downstream task and employs a meta-learning approach to learn the optimal mixture distribution of existing synthetic distributions.
arXiv Detail & Related papers (2023-09-10T16:56:46Z) - Post-processing Private Synthetic Data for Improving Utility on Selected
Measures [7.371282202708775]
We introduce a post-processing technique that improves the utility of the synthetic data with respect to measures selected by the end user.
Our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.
arXiv Detail & Related papers (2023-05-24T19:49:50Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Enabling Synthetic Data adoption in regulated domains [1.9512796489908306]
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms.
In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for.
A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties.
arXiv Detail & Related papers (2022-04-13T10:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.