Related papers: Data-Centric Visual Development for Self-Driving Labs

Data-Centric Visual Development for Self-Driving Labs

URL: http://arxiv.org/abs/2512.02018v1
Date: Mon, 01 Dec 2025 18:59:57 GMT
Title: Data-Centric Visual Development for Self-Driving Labs
Authors: Anbang Liu, Guanzhong Hu, Jiayi Wang, Ping Guo, Han Liu,
Abstract summary: We focus on pipetting, the most critical and precision sensitive action in SDLs.<n>We build a hybrid pipeline that fuses real and virtual data generation.<n>On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy.
Score: 11.559239027724884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological sciences. Yet their stringent precision requirements demand highly robust models whose training relies on large amounts of annotated data. However, this kind of data is difficult to obtain in routine practice, especially negative samples. In this work, we focus on pipetting, the most critical and precision sensitive action in SDLs. To overcome the scarcity of training data, we build a hybrid pipeline that fuses real and virtual data generation. The real track adopts a human-in-the-loop scheme that couples automated acquisition with selective human verification to maximize accuracy with minimal effort. The virtual track augments the real data using reference-conditioned, prompt-guided image generation, which is further screened and validated for reliability. Together, these two tracks yield a class-balanced dataset that enables robust bubble detection training. On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy, and mixing real and generated data during training sustains 99.4% accuracy while reducing collection and review load. Our approach offers a scalable and cost-effective strategy for supplying visual feedback data to SDL workflows and provides a practical solution to data scarcity in rare event detection and broader vision tasks.

Related papers

Diffusion-Based Generation and Imputation of Driving Scenarios from Limited Vehicle CAN Data [13.575299934411978]
Diffusion models have shown to be effective to generate realistic and synthetic data.<n>We propose a hybrid generative approach that combines autoregressive and non-autoregressive techniques.<n>Our best model is able to outperform even the training data in terms of physical correctness, while showing plausible driving behavior.
arXiv Detail & Related papers (2025-09-15T19:07:28Z)
Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z)
Synthetic Dataset Generation for Autonomous Mobile Robots Using 3D Gaussian Splatting for Vision Training [0.708987965338602]
We propose a novel method for automatically generating annotated synthetic data in Unreal Engine.<n>We demonstrate that synthetic datasets can achieve performance comparable to that of real-world datasets.<n>This is the first application of synthetic data for training object detection algorithms in robot soccer.
arXiv Detail & Related papers (2025-06-05T14:37:40Z)
Analysis of Classifier Training on Synthetic Data for Cross-Domain Datasets [4.696575161583618]
This study focuses on camera-based traffic sign recognition applications for advanced driver assistance systems and autonomous driving. The proposed augmentation pipeline of synthetic datasets includes novel augmentation processes such as structured shadows and gaussian specular highlights. Experiments showed that a synthetic image-based approach outperforms in most cases real image-based training when applied to cross-domain test datasets.
arXiv Detail & Related papers (2024-10-30T07:11:41Z)
Learning from Imperfect Demonstrations with Self-Supervision for Robotic Manipulation [30.791222277450053]
Current imitation learning (IL) typically discards imperfect data, focusing solely on successful expert data.<n>We introduce a Self-Supervised Data Filtering framework (SSDF) that combines expert and imperfect data to compute quality scores for failed trajectory segments.<n>SSDF can accurately expand the training dataset with high-quality imperfect data and improve the success rates for all robotic manipulation tasks.
arXiv Detail & Related papers (2024-01-17T04:15:56Z)
A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z)
Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data. Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data. Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z)
AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective. We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
Deep Traffic Sign Detection and Recognition Without Target Domain Real Images [52.079665469286496]
We propose a novel database generation method that requires no real image from the target-domain, and (ii) templates of the traffic signs. The method does not aim at overcoming the training with real data, but to be a compatible alternative when the real data is not available. On large data sets, training with a fully synthetic data set almost matches the performance of training with a real one.
arXiv Detail & Related papers (2020-07-30T21:06:47Z)
Provably Efficient Causal Reinforcement Learning with Confounded Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting. We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.