Synthetic Dataset Evaluation Based on Generalized Cross Validation
- URL: http://arxiv.org/abs/2509.11273v1
- Date: Sun, 14 Sep 2025 13:57:33 GMT
- Title: Synthetic Dataset Evaluation Based on Generalized Cross Validation
- Authors: Zhihang Song, Dingyi Yao, Ruibo Ming, Lihui Peng, Danya Yao, Yi Zhang,
- Abstract summary: Current evaluation studies for synthetic datasets remain limited, lacking a universally accepted standard framework.<n>This paper proposes a novel evaluation framework integrating generalized cross-validation experiments and domain transfer learning principles.<n>One measures the simulation quality by quantifying the similarity between synthetic data and real-world datasets, while another evaluates the transfer quality by assessing the diversity and coverage of synthetic data.
- Score: 6.672552664633057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of synthetic dataset generation techniques, evaluating the quality of synthetic data has become a critical research focus. Robust evaluation not only drives innovations in data generation methods but also guides researchers in optimizing the utilization of these synthetic resources. However, current evaluation studies for synthetic datasets remain limited, lacking a universally accepted standard framework. To address this, this paper proposes a novel evaluation framework integrating generalized cross-validation experiments and domain transfer learning principles, enabling generalizable and comparable assessments of synthetic dataset quality. The framework involves training task-specific models (e.g., YOLOv5s) on both synthetic datasets and multiple real-world benchmarks (e.g., KITTI, BDD100K), forming a cross-performance matrix. Following normalization, a Generalized Cross-Validation (GCV) Matrix is constructed to quantify domain transferability. The framework introduces two key metrics. One measures the simulation quality by quantifying the similarity between synthetic data and real-world datasets, while another evaluates the transfer quality by assessing the diversity and coverage of synthetic data across various real-world scenarios. Experimental validation on Virtual KITTI demonstrates the effectiveness of our proposed framework and metrics in assessing synthetic data fidelity. This scalable and quantifiable evaluation solution overcomes traditional limitations, providing a principled approach to guide synthetic dataset optimization in artificial intelligence research.
Related papers
- The LLM Data Auditor: A Metric-oriented Survey on Quality and Trustworthiness in Evaluating Synthetic Data [25.926467401802046]
Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities.<n>We propose a framework for evaluating synthetic data from two dimensions: quality and trustworthiness.
arXiv Detail & Related papers (2026-01-25T06:40:25Z) - Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z) - Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data [53.040873127309766]
We propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning.<n>Our method outperforms existing models on both in-dataset and cross-dataset evaluations.
arXiv Detail & Related papers (2025-09-08T17:58:06Z) - Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework [0.4874819476581695]
evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research.<n>We present a framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy.
arXiv Detail & Related papers (2025-04-02T17:10:30Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - A Study on Improving Realism of Synthetic Data for Machine Learning [6.806559012493756]
This work aims to train and evaluate a synthetic-to-real generative model that transforms the synthetic renderings into more realistic styles on general-purpose datasets conditioned with unlabeled real-world data.
arXiv Detail & Related papers (2023-04-24T21:41:54Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - On the use of automatically generated synthetic image datasets for
benchmarking face recognition [2.0196229393131726]
Recent advances in Generative Adversarial Networks (GANs) provide a pathway to replace real datasets by synthetic datasets.
Recent advances in Generative Adversarial Networks (GANs) to synthesize realistic face images provide a pathway to replace real datasets by synthetic datasets.
benchmarking results on the synthetic dataset are a good substitution, often providing error rates and system ranking similar to the benchmarking on the real dataset.
arXiv Detail & Related papers (2021-06-08T09:54:02Z) - Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic
Data [0.0]
AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting data sharing.
We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions.
arXiv Detail & Related papers (2021-04-01T17:30:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.