Rule-adhering synthetic data -- the lingua franca of learning
- URL: http://arxiv.org/abs/2209.06679v1
- Date: Mon, 12 Sep 2022 20:01:13 GMT
- Title: Rule-adhering synthetic data -- the lingua franca of learning
- Authors: Michael Platzer and Ivona Krchova
- Abstract summary: In this work we explore approaches of incorporating domain expertise into the data synthesis.
The resulting synthetic data generator can be probed for any number of new samples.
We demonstrate the concept for a publicly available data set, and evaluate its benefits via descriptive analysis as well as a downstream ML model.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI-generated synthetic data allows to distill the general patterns of
existing data, that can then be shared safely as granular-level representative,
yet novel data samples within the original semantics. In this work we explore
approaches of incorporating domain expertise into the data synthesis, to have
the statistical properties as well as pre-existing domain knowledge of rules be
represented. The resulting synthetic data generator, that can be probed for any
number of new samples, can then serve as a common source of intelligence, as a
lingua franca of learning, consumable by humans and machines alike. We
demonstrate the concept for a publicly available data set, and evaluate its
benefits via descriptive analysis as well as a downstream ML model.
Related papers
- Artificial Data, Real Insights: Evaluating Opportunities and Risks of Expanding the Data Ecosystem with Synthetic Data [0.0]
Synthetic Data is not new, but recent advances in Generative AI have raised interest in expanding the research toolbox.
This article provides a taxonomy of the full breadth of the Synthetic Data domain.
arXiv Detail & Related papers (2024-08-10T16:46:35Z) - Preserving correlations: A statistical method for generating synthetic
data [0.0]
We propose a method to generate statistically representative synthetic data.
The main goal is to be able to maintain in the synthetic dataset the correlations of the features present in the original one.
We describe in detail our algorithm used both for the analysis of the original dataset and for the generation of the synthetic data points.
arXiv Detail & Related papers (2024-03-03T10:35:46Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Improving Text Relationship Modeling with Artificial Data [0.07614628596146598]
We apply and evaluate a synthetic data approach to relationship classification in digital libraries.
We find that for classification on whole-part relationships between books, synthetic data improves a deep neural network classifier by 91%.
arXiv Detail & Related papers (2020-10-27T22:04:54Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z) - Assembling Semantically-Disentangled Representations for
Predictive-Generative Models via Adaptation from Synthetic Domain [32.42156485883356]
We show that semantically-aligned representations can be generated with the help of a physics based engine.
It is shown that the proposed (SYNTH-VAE-GAN) method can construct a conditional-generative model of human face attributes without relying on real data labels.
arXiv Detail & Related papers (2020-02-23T03:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.