Related papers: A Style-Based Profiling Framework for Quantifying the Synthetic-to-Real Gap in Autonomous Driving Datasets

A Style-Based Profiling Framework for Quantifying the Synthetic-to-Real Gap in Autonomous Driving Datasets

URL: http://arxiv.org/abs/2510.10203v2
Date: Thu, 23 Oct 2025 08:49:56 GMT
Title: A Style-Based Profiling Framework for Quantifying the Synthetic-to-Real Gap in Autonomous Driving Datasets
Authors: Dingyi Yao, Xinyao Han, Ruibo Ming, Zhihang Song, Lihui Peng, Jianming Hu, Danya Yao, Yi Zhang,
Abstract summary: This paper introduces a profile extraction and discovery framework for characterizing the style profiles underlying both synthetic and real image datasets.<n>Our framework combines Gram matrix-based style extraction with metric learning optimized for intra-class compactness and inter-class separation to extract style embeddings.
Score: 9.788200709163064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensuring the reliability of autonomous driving perception systems requires extensive environment-based testing, yet real-world execution is often impractical. Synthetic datasets have therefore emerged as a promising alternative, offering advantages such as cost-effectiveness, bias free labeling, and controllable scenarios. However, the domain gap between synthetic and real-world datasets remains a major obstacle to model generalization. To address this challenge from a data-centric perspective, this paper introduces a profile extraction and discovery framework for characterizing the style profiles underlying both synthetic and real image datasets. We propose Style Embedding Distribution Discrepancy (SEDD) as a novel evaluation metric. Our framework combines Gram matrix-based style extraction with metric learning optimized for intra-class compactness and inter-class separation to extract style embeddings. Furthermore, we establish a benchmark using publicly available datasets. Experiments are conducted on a variety of datasets and sim-to-real methods, and the results show that our method is capable of quantifying the synthetic-to-real gap. This work provides a standardized profiling-based quality control paradigm that enables systematic diagnosis and targeted enhancement of synthetic datasets, advancing future development of data-driven autonomous driving systems.

Related papers

A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport [7.409483754602669]
Synthetic data offers a promising solution to the privacy and accessibility challenges of using smart card data in public transport research.<n>We propose a framework that systematically evaluates synthetic trip data across three complementary dimensions and three hierarchical levels.<n>Results show that synthetic data do not inherently guarantee privacy and there is no "one-size-fits-all" model.
arXiv Detail & Related papers (2025-10-28T12:52:47Z)
Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z)
A Synthetic Dataset for Manometry Recognition in Robotic Applications [0.686108371431346]
We propose a hybrid data synthesis pipeline that integrates procedural rendering and AI-driven video generation.<n>A YOLO-based detector trained on a composite dataset, combining real and synthetic data, outperformed models trained solely on real images.
arXiv Detail & Related papers (2025-08-24T17:52:13Z)
Evaluating the Impact of Synthetic Data on Object Detection Tasks in Autonomous Driving [0.0]
We compare 2D and 3D object detection tasks trained on real, synthetic, and mixed datasets.<n>Our findings demonstrate that the use of a combination of real and synthetic data improves the robustness and generalization of object detection models.
arXiv Detail & Related papers (2025-03-12T20:13:33Z)
Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data [7.603659241572307]
We propose a novel UCB-based training procedure combined with a dynamic usability metric.<n>Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets.<n>We show that our metric is an effective way to rank synthetic images based on their usability.
arXiv Detail & Related papers (2024-12-06T23:36:36Z)
Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems [0.0]
Synthetic datasets are important for evaluating and testing machine learning models.<n>We develop a novel framework for generating synthetic datasets that are diverse and statistically coherent.<n>The framework is available as a free open Python package to facilitate research with minimal friction.
arXiv Detail & Related papers (2024-11-27T09:53:14Z)
Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. This synthetic data is employed to evaluate the robustness of pretrained segmenters. We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z)
Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future [130.87142103774752]
This review systematically assesses over seventy open-source autonomous driving datasets. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets. It also delves into the scientific and technical challenges that warrant resolution.
arXiv Detail & Related papers (2023-12-06T10:46:53Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
Style-Hallucinated Dual Consistency Learning for Domain Generalized Semantic Segmentation [117.3856882511919]
We propose the Style-HAllucinated Dual consistEncy learning (SHADE) framework to handle domain shift. Our SHADE yields significant improvement and outperforms state-of-the-art methods by 5.07% and 8.35% on the average mIoU of three real-world datasets.
arXiv Detail & Related papers (2022-04-06T02:49:06Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
Sim-to-Real Domain Adaptation for Lane Detection and Classification in Autonomous Driving [0.0]
Unsupervised Domain Adaptation (UDA) approaches are considered low-cost and less time-consuming. We propose UDA schemes using adversarial discriminative and generative methods for lane detection and classification applications in autonomous driving.
arXiv Detail & Related papers (2022-02-15T02:10:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.