SynSpill: Improved Industrial Spill Detection With Synthetic Data
- URL: http://arxiv.org/abs/2508.10171v1
- Date: Wed, 13 Aug 2025 20:09:58 GMT
- Title: SynSpill: Improved Industrial Spill Detection With Synthetic Data
- Authors: Aaditya Baranwal, Abdul Mueez, Jason Voelker, Guneet Bhatia, Shruti Vyas,
- Abstract summary: Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities.<n>Their performance degrades significantly in niche, safety-critical domains such as industrial spill detection.<n>We introduce a scalable framework centered on a high-quality synthetic data generation pipeline.
- Score: 3.297182592932918
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity -- driven by privacy concerns, data sensitivity, and the infrequency of real incidents -- renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: https://synspill.vercel.app
Related papers
- SynSacc: A Blender-to-V2E Pipeline for Synthetic Neuromorphic Eye-Movement Data and Sim-to-Real Spiking Model Training [6.113940256355538]
We introduce a synthetic dataset generated with Blender to simulate saccades and fixations under controlled conditions.<n>We evaluate its robustness by training two architectures and finetuning on real event data.<n>The proposed models achieve up to 0.83 accuracy and maintain consistent performance across varying temporal resolutions.
arXiv Detail & Related papers (2026-02-09T14:34:31Z) - Adapting Web Agents with Synthetic Supervision [80.89365133130558]
Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations.<n>Recent works have explored synthetic data generation to address this challenge.<n>We propose SynthAgent, a fully synthetic supervision framework.
arXiv Detail & Related papers (2025-11-08T18:45:33Z) - Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - A Synthetic Dataset for Manometry Recognition in Robotic Applications [0.686108371431346]
We propose a hybrid data synthesis pipeline that integrates procedural rendering and AI-driven video generation.<n>A YOLO-based detector trained on a composite dataset, combining real and synthetic data, outperformed models trained solely on real images.
arXiv Detail & Related papers (2025-08-24T17:52:13Z) - AI-Generated Fall Data: Assessing LLMs and Diffusion Model for Wearable Fall Detection [3.5912245880418125]
Training fall detection systems is challenging due to the scarcity of real-world fall data, particularly from elderly individuals.<n>This study evaluates text-to-motion and text-to-text models in simulating realistic fall scenarios.<n>We generate synthetic datasets and integrate them with four real-world baseline datasets to assess their impact on fall detection performance.
arXiv Detail & Related papers (2025-05-07T02:30:33Z) - Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map [50.21082069320818]
We propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision.<n>Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks.<n>Results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data.
arXiv Detail & Related papers (2025-05-06T15:21:36Z) - Evaluating the Impact of Synthetic Data on Object Detection Tasks in Autonomous Driving [0.0]
We compare 2D and 3D object detection tasks trained on real, synthetic, and mixed datasets.<n>Our findings demonstrate that the use of a combination of real and synthetic data improves the robustness and generalization of object detection models.
arXiv Detail & Related papers (2025-03-12T20:13:33Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Instance-Level Safety-Aware Fidelity of Synthetic Data and Its Calibration [5.089356301032639]
We focus on its role in safety-critical applications, introducing four types of instance-level fidelity.
The aim is to ensure that applying testing on synthetic data can reveal real-world safety issues.
arXiv Detail & Related papers (2024-02-10T19:45:40Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception [62.71374902455154]
We leverage recent advancements in neural rendering to improve static and dynamic novelview UAV-based image rendering.
We demonstrate a considerable performance boost when a state-of-the-art detection model is optimized primarily on hybrid sets of real and synthetic data.
arXiv Detail & Related papers (2023-10-25T00:20:37Z) - PLASTIC: Improving Input and Label Plasticity for Sample Efficient
Reinforcement Learning [54.409634256153154]
In Reinforcement Learning (RL), enhancing sample efficiency is crucial.
In principle, off-policy RL algorithms can improve sample efficiency by allowing multiple updates per environment interaction.
Our study investigates the underlying causes of this phenomenon by dividing plasticity into two aspects.
arXiv Detail & Related papers (2023-06-19T06:14:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.