Related papers: Rule-adhering synthetic data -- the lingua franca of learning

Rule-adhering synthetic data -- the lingua franca of learning

URL: http://arxiv.org/abs/2209.06679v1
Date: Mon, 12 Sep 2022 20:01:13 GMT
Title: Rule-adhering synthetic data -- the lingua franca of learning
Authors: Michael Platzer and Ivona Krchova
Abstract summary: In this work we explore approaches of incorporating domain expertise into the data synthesis. The resulting synthetic data generator can be probed for any number of new samples. We demonstrate the concept for a publicly available data set, and evaluate its benefits via descriptive analysis as well as a downstream ML model.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI-generated synthetic data allows to distill the general patterns of existing data, that can then be shared safely as granular-level representative, yet novel data samples within the original semantics. In this work we explore approaches of incorporating domain expertise into the data synthesis, to have the statistical properties as well as pre-existing domain knowledge of rules be represented. The resulting synthetic data generator, that can be probed for any number of new samples, can then serve as a common source of intelligence, as a lingua franca of learning, consumable by humans and machines alike. We demonstrate the concept for a publicly available data set, and evaluate its benefits via descriptive analysis as well as a downstream ML model.

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice. Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities. We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z)
Data-Constrained Synthesis of Training Data for De-Identification [0.0]
We domain-adapt large language models (LLMs) to the clinical domain. We generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information. The synthetic corpora are then used to train synthetic NER models.
arXiv Detail & Related papers (2025-02-20T16:09:27Z)
Exploring the Potential of Synthetic Data to Replace Real Data [16.89582896061033]
We find that the potential of synthetic data to replace real data varies depending on the number of cross-domain real images and the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and $textAP_textt2t$, to evaluate the ability of a cross-domain training set using synthetic data.
arXiv Detail & Related papers (2024-08-26T18:20:18Z)
Artificial Data, Real Insights: Evaluating Opportunities and Risks of Expanding the Data Ecosystem with Synthetic Data [0.0]
Synthetic Data is not new, but recent advances in Generative AI have raised interest in expanding the research toolbox. This article provides a taxonomy of the full breadth of the Synthetic Data domain.
arXiv Detail & Related papers (2024-08-10T16:46:35Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Synthetic data generation for a longitudinal cohort study -- Evaluation, method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data. In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process. We generate a representative as well as fair version of the UCI Adult census data set. We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
Improving Text Relationship Modeling with Artificial Data [0.07614628596146598]
We apply and evaluate a synthetic data approach to relationship classification in digital libraries. We find that for classification on whole-part relationships between books, synthetic data improves a deep neural network classifier by 91%.
arXiv Detail & Related papers (2020-10-27T22:04:54Z)
Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset. With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset. In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)
Assembling Semantically-Disentangled Representations for Predictive-Generative Models via Adaptation from Synthetic Domain [32.42156485883356]
We show that semantically-aligned representations can be generated with the help of a physics based engine. It is shown that the proposed (SYNTH-VAE-GAN) method can construct a conditional-generative model of human face attributes without relying on real data labels.
arXiv Detail & Related papers (2020-02-23T03:35:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.