Knowing the Distance: Understanding the Gap Between Synthetic and Real
Data For Face Parsing
- URL: http://arxiv.org/abs/2303.15219v1
- Date: Mon, 27 Mar 2023 13:59:26 GMT
- Title: Knowing the Distance: Understanding the Gap Between Synthetic and Real
Data For Face Parsing
- Authors: Eli Friedman, Assaf Lehr, Alexey Gruzdev, Vladimir Loginov, Max Kogan,
Moran Rubin, Orly Zvitia
- Abstract summary: We show that the distribution gap is the largest contributor to the performance gap, accounting for over 50% of the gap.
This suggests that synthetic data is a viable alternative to real data, especially when real data is limited or difficult to obtain.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The use of synthetic data for training computer vision algorithms has become
increasingly popular due to its cost-effectiveness, scalability, and ability to
provide accurate multi-modality labels. Although recent studies have
demonstrated impressive results when training networks solely on synthetic
data, there remains a performance gap between synthetic and real data that is
commonly attributed to lack of photorealism. The aim of this study is to
investigate the gap in greater detail for the face parsing task. We
differentiate between three types of gaps: distribution gap, label gap, and
photorealism gap. Our findings show that the distribution gap is the largest
contributor to the performance gap, accounting for over 50% of the gap. By
addressing this gap and accounting for the labels gap, we demonstrate that a
model trained on synthetic data achieves comparable results to one trained on a
similar amount of real data. This suggests that synthetic data is a viable
alternative to real data, especially when real data is limited or difficult to
obtain. Our study highlights the importance of content diversity in synthetic
datasets and challenges the notion that the photorealism gap is the most
critical factor affecting the performance of computer vision models trained on
synthetic data.
Related papers
- Synthetic Simplicity: Unveiling Bias in Medical Data Augmentation [0.7499722271664144]
Synthetic data is becoming increasingly integral in data-scarce fields such as medical imaging.
downstream neural networks often exploit spurious distinctions between real and synthetic data when there is a strong correlation between the data source and the task label.
This exploitation manifests as textitsimplicity bias, where models overly rely on superficial features rather than genuine task-related complexities.
arXiv Detail & Related papers (2024-07-31T15:14:17Z) - Exploring the Impact of Synthetic Data for Aerial-view Human Detection [17.41001388151408]
Aerial-view human detection has a large demand for large-scale data to capture more diverse human appearances.
Synthetic data can be a good resource to expand data, but the domain gap with real-world data is the biggest obstacle to its use in training.
arXiv Detail & Related papers (2024-05-24T04:19:48Z) - Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data [2.6016285265085526]
Student models show a significant drop in accuracy compared to models trained on real data.
By training these layers using either real or synthetic data, we reveal that the drop mainly stems from the model's final layers.
Our results suggest an improved trade-off between the amount of real training data used and the model's accuracy.
arXiv Detail & Related papers (2024-05-06T07:51:13Z) - Massively Annotated Datasets for Assessment of Synthetic and Real Data in Face Recognition [0.2775636978045794]
We study the drift between the performance of models trained on real and synthetic datasets.
We conduct studies on the differences between real and synthetic datasets on the attribute set.
Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.
arXiv Detail & Related papers (2024-04-23T17:10:49Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Object Detector Differences when using Synthetic and Real Training Data [0.0]
We train the YOLOv3 object detector on real and synthetic images from city environments.
We perform a similarity analysis using Centered Kernel Alignment (CKA) to explore the effects of training on synthetic data on a layer-wise basis.
The results show that the largest similarity between a detector trained on real data and a detector trained on synthetic data was in the early layers, and the largest difference was in the head part.
arXiv Detail & Related papers (2023-12-01T16:27:48Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances [76.34037366117234]
We introduce a new dataset called Robot Control Gestures (RoCoG-v2)
The dataset is composed of both real and synthetic videos from seven gesture classes.
We present results using state-of-the-art action recognition and domain adaptation algorithms.
arXiv Detail & Related papers (2023-03-17T23:23:55Z) - Synthetic Data for Object Classification in Industrial Applications [53.180678723280145]
In object classification, capturing a large number of images per object and in different conditions is not always possible.
This work explores the creation of artificial images using a game engine to cope with limited data in the training dataset.
arXiv Detail & Related papers (2022-12-09T11:43:04Z) - Beyond spectral gap: The role of the topology in decentralized learning [58.48291921602417]
In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model.
This paper aims to paint an accurate picture of sparsely-connected distributed optimization when workers share the same data distribution.
Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.
arXiv Detail & Related papers (2022-06-07T08:19:06Z) - Fairness in Semi-supervised Learning: Unlabeled Data Help to Reduce
Discrimination [53.3082498402884]
A growing specter in the rise of machine learning is whether the decisions made by machine learning models are fair.
We present a framework of fair semi-supervised learning in the pre-processing phase, including pseudo labeling to predict labels for unlabeled data.
A theoretical decomposition analysis of bias, variance and noise highlights the different sources of discrimination and the impact they have on fairness in semi-supervised learning.
arXiv Detail & Related papers (2020-09-25T05:48:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.