Related papers: Exploring the Potential of Synthetic Data to Replace Real Data

Exploring the Potential of Synthetic Data to Replace Real Data

URL: http://arxiv.org/abs/2408.14559v1
Date: Mon, 26 Aug 2024 18:20:18 GMT
Title: Exploring the Potential of Synthetic Data to Replace Real Data
Authors: Hyungtae Lee, Yan Zhang, Heesung Kwon, Shuvra S. Bhattacharrya,
Abstract summary: We find that the potential of synthetic data to replace real data varies depending on the number of cross-domain real images and the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and $textAP_textt2t$, to evaluate the ability of a cross-domain training set using synthetic data.
Score: 16.89582896061033
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and $\text{AP}_\text{t2t}$, to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will encourage more widespread use of synthetic data.

Related papers

Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z)
What's Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models [1.024113475677323]
We apply explainable AI (XAI) techniques to a binary detection classifier trained to distinguish real from synthetic data. While the classifier identifies distributional differences, XAI concepts, analyzed through methods like permutation feature importance, partial dependence plots, Shapley values, reveal why synthetic data are distinguishable. This interpretability increases transparency in synthetic data evaluation and provides deeper insights beyond conventional metrics.
arXiv Detail & Related papers (2025-04-29T12:10:52Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Generate to Discriminate: Expert Routing for Continual Learning [59.71853576559306]
Generate to Discriminate (G2D) is a continual learning method that leverages synthetic data to train a domain-discriminator. We observe that G2D outperforms competitive domain-incremental learning methods on tasks in both vision and language modalities.
arXiv Detail & Related papers (2024-12-22T13:16:28Z)
Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline Methodology [0.14980193397844666]
We propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data. Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images.
arXiv Detail & Related papers (2024-05-30T08:31:01Z)
Exploring the Impact of Synthetic Data for Aerial-view Human Detection [17.41001388151408]
Aerial-view human detection has a large demand for large-scale data to capture more diverse human appearances. Synthetic data can be a good resource to expand data, but the domain gap with real-world data is the biggest obstacle to its use in training.
arXiv Detail & Related papers (2024-05-24T04:19:48Z)
Massively Annotated Datasets for Assessment of Synthetic and Real Data in Face Recognition [0.2775636978045794]
We study the drift between the performance of models trained on real and synthetic datasets. We conduct studies on the differences between real and synthetic datasets on the attribute set. Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.
arXiv Detail & Related papers (2024-04-23T17:10:49Z)
Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. This synthetic data is employed to evaluate the robustness of pretrained segmenters. We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances [76.34037366117234]
We introduce a new dataset called Robot Control Gestures (RoCoG-v2) The dataset is composed of both real and synthetic videos from seven gesture classes. We present results using state-of-the-art action recognition and domain adaptation algorithms.
arXiv Detail & Related papers (2023-03-17T23:23:55Z)
Analysis of Training Object Detection Models with Synthetic Data [0.0]
This paper attempts to provide a holistic overview of how to use synthetic data for object detection. We analyse aspects of generating the data as well as techniques used to train the models. Experiments are validated on real data and benchmarked to models trained on real data.
arXiv Detail & Related papers (2022-11-29T10:21:16Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data. The proposed method is compared with two statistical approaches based on Universal and User-dependent models. Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z)
Synthetic Data for Model Selection [2.4499092754102874]
We show that synthetic data can be beneficial for model selection. We introduce a novel method to calibrate the synthetic error estimation to fit that of the real domain.
arXiv Detail & Related papers (2021-05-03T09:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.