Synthcity: facilitating innovative use cases of synthetic data in
different data modalities
- URL: http://arxiv.org/abs/2301.07573v1
- Date: Wed, 18 Jan 2023 14:49:54 GMT
- Title: Synthcity: facilitating innovative use cases of synthetic data in
different data modalities
- Authors: Zhaozhi Qian, Bogdan-Constantin Cebere, Mihaela van der Schaar
- Abstract summary: Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation.
Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data.
- Score: 86.52703093858631
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthcity is an open-source software package for innovative use cases of
synthetic data in ML fairness, privacy and augmentation across diverse tabular
data modalities, including static data, regular and irregular time series, data
with censoring, multi-source data, composite data, and more. Synthcity provides
the practitioners with a single access point to cutting edge research and tools
in synthetic data. It also offers the community a playground for rapid
experimentation and prototyping, a one-stop-shop for SOTA benchmarks, and an
opportunity for extending research impact. The library can be accessed on
GitHub (https://github.com/vanderschaarlab/synthcity) and pip
(https://pypi.org/project/synthcity/). We warmly invite the community to join
the development effort by providing feedback, reporting bugs, and contributing
code.
Related papers
- Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment [39.137060714048175]
We argue that enhancing diversity can improve the parallelizable yet isolated approach to synthesizing datasets.
We introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process.
Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset.
arXiv Detail & Related papers (2024-09-26T08:03:19Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - On the Usefulness of Synthetic Tabular Data Generation [3.04585143845864]
It is commonly believed that synthetic data can be used for both data exchange and boosting machine learning (ML) training.
Privacy-preserving synthetic data generation can accelerate data exchange for downstream tasks, but there is not enough evidence to show how or why synthetic data can boost ML training.
arXiv Detail & Related papers (2023-06-27T17:26:23Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - TabSynDex: A Universal Metric for Robust Evaluation of Synthetic Tabular Data [14.900342838726747]
We propose a new universal metric, TabSynDex, for robust evaluation of synthetic data.
Being a single score metric, TabSynDex can also be used to observe and evaluate the training of neural network based approaches.
arXiv Detail & Related papers (2022-07-12T04:08:11Z) - FedSyn: Synthetic Data Generation using Federated Learning [0.0]
Current Machine Learning practices can be leveraged to generate synthetic data from an existing dataset.
Data privacy concerns that some institutions may not be comfortable with.
This paper proposes a novel approach to generate synthetic data - FedSyn.
arXiv Detail & Related papers (2022-03-11T14:05:37Z) - Shape of synth to come: Why we should use synthetic data for English
surface realization [72.62356061765976]
In the 2018 shared task there was very little difference in the absolute performance of systems trained with and without additional, synthetically created data.
We show, in experiments on the English 2018 dataset, that the use of synthetic data can have a substantial positive effect.
We argue that its use should be encouraged rather than prohibited so that future research efforts continue to explore systems that can take advantage of such data.
arXiv Detail & Related papers (2020-05-06T10:00:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.