SIG: A Synthetic Identity Generation Pipeline for Generating Evaluation Datasets for Face Recognition
- URL: http://arxiv.org/abs/2409.08345v2
- Date: Tue, 17 Sep 2024 18:19:24 GMT
- Title: SIG: A Synthetic Identity Generation Pipeline for Generating Evaluation Datasets for Face Recognition
- Authors: Kassi Nzalasse, Rishav Raj, Eli Laird, Corey Clark,
- Abstract summary: We introduce the Synthetic Identity Generation pipeline, or SIG, that allows for the targeted creation of ethical, balanced datasets for face recognition evaluation.
Our pipeline generates high-quality images of synthetic identities with controllable pose, facial features, and demographic attributes, such as race, gender, and age.
We also release an open-source evaluation dataset named ControlFace10k, consisting of 10,008 face images of 3,336 unique synthetic identities balanced across race, gender, and age.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Artificial Intelligence applications expand, the evaluation of models faces heightened scrutiny. Ensuring public readiness requires evaluation datasets, which differ from training data by being disjoint and ethically sourced in compliance with privacy regulations. The performance and fairness of face recognition systems depend significantly on the quality and representativeness of these evaluation datasets. This data is sometimes scraped from the internet without user's consent, causing ethical concerns that can prohibit its use without proper releases. In rare cases, data is collected in a controlled environment with consent, however, this process is time-consuming, expensive, and logistically difficult to execute. This creates a barrier for those unable to conjure the immense resources required to gather ethically sourced evaluation datasets. To address these challenges, we introduce the Synthetic Identity Generation pipeline, or SIG, that allows for the targeted creation of ethical, balanced datasets for face recognition evaluation. Our proposed and demonstrated pipeline generates high-quality images of synthetic identities with controllable pose, facial features, and demographic attributes, such as race, gender, and age. We also release an open-source evaluation dataset named ControlFace10k, consisting of 10,008 face images of 3,336 unique synthetic identities balanced across race, gender, and age, generated using the proposed SIG pipeline. We analyze ControlFace10k along with a non-synthetic BUPT dataset using state-of-the-art face recognition algorithms to demonstrate its effectiveness as an evaluation tool. This analysis highlights the dataset's characteristics and its utility in assessing algorithmic bias across different demographic groups.
Related papers
- ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition [60.15830516741776]
Synthetic face recognition (SFR) aims to generate datasets that mimic the distribution of real face data.
We introduce a diffusion-fueled SFR model termed $textID3$.
$textID3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances.
arXiv Detail & Related papers (2024-09-26T06:46:40Z) - Synthetic Counterfactual Faces [1.3062016289815055]
We build a generative AI framework to construct targeted, counterfactual, high-quality synthetic face data.
Our pipeline has many use cases, including face recognition systems sensitivity evaluations and image understanding system probes.
We showcase the efficacy of our face generation pipeline on a leading commercial vision model.
arXiv Detail & Related papers (2024-07-18T22:22:49Z) - Toward Fairer Face Recognition Datasets [69.04239222633795]
Face recognition and verification are computer vision tasks whose performance has progressed with the introduction of deep representations.
Ethical, legal, and technical challenges due to the sensitive character of face data and biases in real training datasets hinder their development.
We promote fairness by introducing a demographic attributes balancing mechanism in generated training datasets.
arXiv Detail & Related papers (2024-06-24T12:33:21Z) - SDFR: Synthetic Data for Face Recognition Competition [51.9134406629509]
Large-scale face recognition datasets are collected by crawling the Internet and without individuals' consent, raising legal, ethical, and privacy concerns.
Recently several works proposed generating synthetic face recognition datasets to mitigate concerns in web-crawled face recognition datasets.
This paper presents the summary of the Synthetic Data for Face Recognition (SDFR) Competition held in conjunction with the 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2024)
The SDFR competition was split into two tasks, allowing participants to train face recognition systems using new synthetic datasets and/or existing ones.
arXiv Detail & Related papers (2024-04-06T10:30:31Z) - Leveraging Synthetic Data for Generalizable and Fair Facial Action Unit Detection [9.404202619102943]
We propose to use synthetically generated data and multi-source domain adaptation (MSDA) to address the problems of the scarcity of labeled data and the diversity of subjects.
Specifically, we propose to generate a diverse dataset through synthetic facial expression re-targeting.
To further improve gender fairness, PM2 matches the features of the real data with a female and a male synthetic image.
arXiv Detail & Related papers (2024-03-15T23:50:18Z) - My Face My Choice: Privacy Enhancing Deepfakes for Social Media
Anonymization [4.725675279167593]
We introduce three face access models in a hypothetical social network, where the user has the power to only appear in photos they approve.
Our approach eclipses current tagging systems and replaces unapproved faces with quantitatively dissimilar deepfakes.
Running seven SOTA face recognizers on our results, MFMC reduces the average accuracy by 61%.
arXiv Detail & Related papers (2022-11-02T17:58:20Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - AI-based Re-identification of Behavioral Clickstream Data [0.0]
This paper demonstrates that similar techniques can be applied to successfully re-identify individuals purely based on their behavioral patterns.
The mere resemblance of behavioral patterns between records is sufficient to correctly attribute behavioral data to identified individuals.
We also demonstrate how synthetic data can offer a viable alternative, that is shown to be resilient against our introduced AI-based re-identification attacks.
arXiv Detail & Related papers (2022-01-21T16:49:00Z) - RealGait: Gait Recognition for Person Re-Identification [79.67088297584762]
We construct a new gait dataset by extracting silhouettes from an existing video person re-identification challenge which consists of 1,404 persons walking in an unconstrained manner.
Our results suggest that recognizing people by their gait in real surveillance scenarios is feasible and the underlying gait pattern is probably the true reason why video person re-idenfification works in practice.
arXiv Detail & Related papers (2022-01-13T06:30:56Z) - On the use of automatically generated synthetic image datasets for
benchmarking face recognition [2.0196229393131726]
Recent advances in Generative Adversarial Networks (GANs) provide a pathway to replace real datasets by synthetic datasets.
Recent advances in Generative Adversarial Networks (GANs) to synthesize realistic face images provide a pathway to replace real datasets by synthetic datasets.
benchmarking results on the synthetic dataset are a good substitution, often providing error rates and system ranking similar to the benchmarking on the real dataset.
arXiv Detail & Related papers (2021-06-08T09:54:02Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.