Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data
- URL: http://arxiv.org/abs/2406.19175v1
- Date: Thu, 27 Jun 2024 13:51:53 GMT
- Title: Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data
- Authors: Lukas Malte Kemeter, Rasmus Hvingelby, Paulina Sierak, Tobias Schön, Bishwajit Gosswam,
- Abstract summary: In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost.
Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data.
We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels.
- Score: 0.04194295877935867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. However, relying solely on synthetic data is frequently inadequate for effectively training models that perform well on real-world data, primarily due to domain shifts between the synthetic and real-world data. We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels. Using both simulated and real-world X-ray images, we train an object detection model with different strategies to identify the training approach that generates the best detection results while minimising the demand for annotated real-world training samples. Our preliminary findings suggest that the sim-2-real domain adaptation approach is more cost-efficient than a fully supervised oracle - if the total number of available annotated samples is fixed. Given a certain number of labeled real-world samples, training on a mix of synthetic and unlabeled real-world data achieved comparable or even better detection results at significantly lower cost. We argue that future research into the cost-efficiency of different training strategies is important for a better understanding of how to allocate budget in applied machine learning projects.
Related papers
- Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline Methodology [0.14980193397844666]
We propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data.
Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images.
arXiv Detail & Related papers (2024-05-30T08:31:01Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection [2.7648976108201815]
We propose to use a Generative Adversarial Network (GAN) to close the gap between the real and synthetic data.
Our approach not only produces visually plausible samples but also does not require any labels of the real domain.
arXiv Detail & Related papers (2023-07-21T05:26:32Z) - A New Benchmark: On the Utility of Synthetic Data with Blender for Bare
Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data.
The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist.
To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z) - One-Shot Domain Adaptive and Generalizable Semantic Segmentation with
Class-Aware Cross-Domain Transformers [96.51828911883456]
Unsupervised sim-to-real domain adaptation (UDA) for semantic segmentation aims to improve the real-world test performance of a model trained on simulated data.
Traditional UDA often assumes that there are abundant unlabeled real-world data samples available during training for the adaptation.
We explore the one-shot unsupervised sim-to-real domain adaptation (OSUDA) and generalization problem, where only one real-world data sample is available.
arXiv Detail & Related papers (2022-12-14T15:54:15Z) - Synthetic Data for Object Classification in Industrial Applications [53.180678723280145]
In object classification, capturing a large number of images per object and in different conditions is not always possible.
This work explores the creation of artificial images using a game engine to cope with limited data in the training dataset.
arXiv Detail & Related papers (2022-12-09T11:43:04Z) - Analysis of Training Object Detection Models with Synthetic Data [0.0]
This paper attempts to provide a holistic overview of how to use synthetic data for object detection.
We analyse aspects of generating the data as well as techniques used to train the models.
Experiments are validated on real data and benchmarked to models trained on real data.
arXiv Detail & Related papers (2022-11-29T10:21:16Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Semi-synthesis: A fast way to produce effective datasets for stereo
matching [16.602343511350252]
Close-to-real-scene texture rendering is a key factor to boost up stereo matching performance.
We propose semi-synthetic, an effective and fast way to synthesize large amount of data with close-to-real-scene texture.
With further fine-tuning on the real dataset, we also achieve SOTA performance on Middlebury and competitive results on KITTI and ETH3D datasets.
arXiv Detail & Related papers (2021-01-26T14:34:49Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.