A Study on Improving Realism of Synthetic Data for Machine Learning
- URL: http://arxiv.org/abs/2304.12463v2
- Date: Fri, 28 Apr 2023 22:29:31 GMT
- Title: A Study on Improving Realism of Synthetic Data for Machine Learning
- Authors: Tingwei Shen, Ganning Zhao, Suya You
- Abstract summary: This work aims to train and evaluate a synthetic-to-real generative model that transforms the synthetic renderings into more realistic styles on general-purpose datasets conditioned with unlabeled real-world data.
- Score: 6.806559012493756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic-to-real data translation using generative adversarial learning has
achieved significant success in improving synthetic data. Yet, limited studies
focus on deep evaluation and comparison of adversarial training on
general-purpose synthetic data for machine learning. This work aims to train
and evaluate a synthetic-to-real generative model that transforms the synthetic
renderings into more realistic styles on general-purpose datasets conditioned
with unlabeled real-world data. Extensive performance evaluation and comparison
have been conducted through qualitative and quantitative metrics and a defined
downstream perception task.
Related papers
- Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - An evaluation framework for synthetic data generation models [3.3748750222488657]
We present a new framework for evaluating synthetic data generation models' ability for developing high-quality synthetic data.
Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data.
arXiv Detail & Related papers (2024-04-13T01:16:45Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - On the Equivalency, Substitutability, and Flexibility of Synthetic Data [9.459709213597707]
We investigate the equivalency of synthetic data to real-world data, the substitutability of synthetic data for real data, and the flexibility of synthetic data generators.
Our results suggest that synthetic data not only enhances model performance but also demonstrates substitutability for real data, with 60% to 80% replacement without performance loss.
arXiv Detail & Related papers (2024-03-24T17:21:32Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations [21.583825474908334]
We study how the performance of models trained on synthetic data may vary with the subjectivity of classification.
Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data.
arXiv Detail & Related papers (2023-10-11T19:51:13Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - A Scaling Law for Synthetic-to-Real Transfer: A Measure of Pre-Training [52.93808218720784]
Synthetic-to-real transfer learning is a framework in which we pre-train models with synthetically generated images and ground-truth annotations for real tasks.
Although synthetic images overcome the data scarcity issue, it remains unclear how the fine-tuning performance scales with pre-trained models.
We observe a simple and general scaling law that consistently describes learning curves in various tasks, models, and complexities of synthesized pre-training data.
arXiv Detail & Related papers (2021-08-25T02:29:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.