Related papers: Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data

Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data

URL: http://arxiv.org/abs/2406.19175v1
Date: Thu, 27 Jun 2024 13:51:53 GMT
Title: Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data
Authors: Lukas Malte Kemeter, Rasmus Hvingelby, Paulina Sierak, Tobias Schön, Bishwajit Gosswam,
Abstract summary: In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels.
Score: 0.04194295877935867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. However, relying solely on synthetic data is frequently inadequate for effectively training models that perform well on real-world data, primarily due to domain shifts between the synthetic and real-world data. We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels. Using both simulated and real-world X-ray images, we train an object detection model with different strategies to identify the training approach that generates the best detection results while minimising the demand for annotated real-world training samples. Our preliminary findings suggest that the sim-2-real domain adaptation approach is more cost-efficient than a fully supervised oracle - if the total number of available annotated samples is fixed. Given a certain number of labeled real-world samples, training on a mix of synthetic and unlabeled real-world data achieved comparable or even better detection results at significantly lower cost. We argue that future research into the cost-efficiency of different training strategies is important for a better understanding of how to allocate budget in applied machine learning projects.

Related papers

The Impact of Synthetic Data on Object Detection Model Performance: A Comparative Analysis with Real-World Data [1.853053680967785]
This work investigates the impact of synthetic data on the performance of object detection models, compared to models trained on real-world data only.<n>It comprises experiments focused on pallet detection in a warehouse setting, utilizing both real and various synthetic dataset generation strategies.
arXiv Detail & Related papers (2025-10-14T06:59:51Z)
Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z)
Synthetic Dataset Generation for Autonomous Mobile Robots Using 3D Gaussian Splatting for Vision Training [0.708987965338602]
We propose a novel method for automatically generating annotated synthetic data in Unreal Engine.<n>We demonstrate that synthetic datasets can achieve performance comparable to that of real-world datasets.<n>This is the first application of synthetic data for training object detection algorithms in robot soccer.
arXiv Detail & Related papers (2025-06-05T14:37:40Z)
Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map [50.21082069320818]
We propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision.<n>Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks.<n>Results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data.
arXiv Detail & Related papers (2025-05-06T15:21:36Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Analysis of Classifier Training on Synthetic Data for Cross-Domain Datasets [4.696575161583618]
This study focuses on camera-based traffic sign recognition applications for advanced driver assistance systems and autonomous driving. The proposed augmentation pipeline of synthetic datasets includes novel augmentation processes such as structured shadows and gaussian specular highlights. Experiments showed that a synthetic image-based approach outperforms in most cases real image-based training when applied to cross-domain test datasets.
arXiv Detail & Related papers (2024-10-30T07:11:41Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Multimodal Misinformation Detection by Learning from Synthetic Data with Multimodal LLMs [13.684959490938269]
We propose learning from synthetic data for detecting real-world multimodal misinformation through two model-agnostic data selection methods. Experiments show that our method enhances the performance of a small MLLM on real-world fact-checking datasets.
arXiv Detail & Related papers (2024-09-29T11:01:14Z)
Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline Methodology [0.14980193397844666]
We propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data. Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images.
arXiv Detail & Related papers (2024-05-30T08:31:01Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection [2.7648976108201815]
We propose to use a Generative Adversarial Network (GAN) to close the gap between the real and synthetic data. Our approach not only produces visually plausible samples but also does not require any labels of the real domain.
arXiv Detail & Related papers (2023-07-21T05:26:32Z)
A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z)
Synthetic Data for Object Classification in Industrial Applications [53.180678723280145]
In object classification, capturing a large number of images per object and in different conditions is not always possible. This work explores the creation of artificial images using a game engine to cope with limited data in the training dataset.
arXiv Detail & Related papers (2022-12-09T11:43:04Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective. We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.