Is Synthetic Dataset Reliable for Benchmarking Generalizable Person
Re-Identification?
- URL: http://arxiv.org/abs/2209.05047v1
- Date: Mon, 12 Sep 2022 06:54:54 GMT
- Title: Is Synthetic Dataset Reliable for Benchmarking Generalizable Person
Re-Identification?
- Authors: Cuicui Kang
- Abstract summary: We show that a recent large-scale synthetic dataset ClonedPerson can be reliably used to benchmark GPReID, statistically the same as real-world datasets.
This study guarantees the usage of synthetic datasets for both source training set and target testing set, with completely no privacy concerns from real-world surveillance data.
- Score: 1.1041211464412568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies show that models trained on synthetic datasets are able to
achieve better generalizable person re-identification (GPReID) performance than
that trained on public real-world datasets. On the other hand, due to the
limitations of real-world person ReID datasets, it would also be important and
interesting to use large-scale synthetic datasets as test sets to benchmark
person ReID algorithms. Yet this raises a critical question: is synthetic
dataset reliable for benchmarking generalizable person re-identification? In
the literature there is no evidence showing this. To address this, we design a
method called Pairwise Ranking Analysis (PRA) to quantitatively measure the
ranking similarity and perform the statistical test of identical distributions.
Specifically, we employ Kendall rank correlation coefficients to evaluate
pairwise similarity values between algorithm rankings on different datasets.
Then, a non-parametric two-sample Kolmogorov-Smirnov (KS) test is performed for
the judgement of whether algorithm ranking correlations between synthetic and
real-world datasets and those only between real-world datasets lie in identical
distributions. We conduct comprehensive experiments, with ten representative
algorithms, three popular real-world person ReID datasets, and three recently
released large-scale synthetic datasets. Through the designed pairwise ranking
analysis and comprehensive evaluations, we conclude that a recent large-scale
synthetic dataset ClonedPerson can be reliably used to benchmark GPReID,
statistically the same as real-world datasets. Therefore, this study guarantees
the usage of synthetic datasets for both source training set and target testing
set, with completely no privacy concerns from real-world surveillance data.
Besides, the study in this paper might also inspire future designs of synthetic
datasets.
Related papers
- Massively Annotated Datasets for Assessment of Synthetic and Real Data in Face Recognition [0.2775636978045794]
We study the drift between the performance of models trained on real and synthetic datasets.
We conduct studies on the differences between real and synthetic datasets on the attribute set.
Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.
arXiv Detail & Related papers (2024-04-23T17:10:49Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Alice Benchmarks: Connecting Real World Re-Identification with the
Synthetic [92.02220105679713]
We introduce the Alice benchmarks, large-scale datasets providing benchmarks and evaluation protocols to the research community.
Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID.
As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario.
arXiv Detail & Related papers (2023-10-06T17:58:26Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Synthetic Benchmarks for Scientific Research in Explainable Machine
Learning [14.172740234933215]
We release XAI-Bench: a suite of synthetic datasets and a library for benchmarking feature attribution algorithms.
Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values.
We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and identifying failure modes for popular explainers.
arXiv Detail & Related papers (2021-06-23T17:10:21Z) - On the use of automatically generated synthetic image datasets for
benchmarking face recognition [2.0196229393131726]
Recent advances in Generative Adversarial Networks (GANs) provide a pathway to replace real datasets by synthetic datasets.
Recent advances in Generative Adversarial Networks (GANs) to synthesize realistic face images provide a pathway to replace real datasets by synthetic datasets.
benchmarking results on the synthetic dataset are a good substitution, often providing error rates and system ranking similar to the benchmarking on the real dataset.
arXiv Detail & Related papers (2021-06-08T09:54:02Z) - Benchmarking the Benchmark -- Analysis of Synthetic NIDS Datasets [4.125187280299247]
We analyse the statistical properties of benign traffic in three of the more recent and relevant NIDS datasets.
Our results show a distinct difference of most of the considered statistical features between the synthetic datasets and two real-world datasets.
arXiv Detail & Related papers (2021-04-19T03:17:37Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.