Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation
- URL: http://arxiv.org/abs/2508.10672v2
- Date: Mon, 18 Aug 2025 09:15:35 GMT
- Title: Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation
- Authors: Feiran Li, Qianqian Xu, Shilong Bao, Boyu Han, Zhiyong Yang, Qingming Huang,
- Abstract summary: We present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model.<n>The constructed dataset must not contain identities overlapping with any existing public face datasets.<n>Our method achieves textbf1st place in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales.
- Score: 87.48785461212556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf{1st place} in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at https://github.com/Ferry-Li/datacv_fr.
Related papers
- Beyond Inference Intervention: Identity-Decoupled Diffusion for Face Anonymization [55.29071072675132]
Face anonymization aims to conceal identity information while preserving non-identity attributes.<n>We propose textbfIDsuperscript2Face, a training-centric anonymization framework.<n>We show that IDtextsuperscript2Face outperforms existing methods in visual quality, identity suppression, and utility preservation.
arXiv Detail & Related papers (2025-10-28T09:28:12Z) - From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts [69.44297222099175]
We introduce a Mixture of Facial Experts (MoFE) that captures distinct but mutually reinforcing aspects of facial attributes.<n>To mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency.<n>We have curated and refined a Large Face Angles (LFA) dataset from existing open-source human video datasets.
arXiv Detail & Related papers (2025-08-13T04:10:16Z) - Vec2Face+ for Face Dataset Generation [19.02273216268032]
Vec2Face+ is a generative model that creates images directly from image features.<n>Our system generates VFace10K, a synthetic face dataset with 10K identities.<n>The corresponding VFace100K and VFace300K datasets yield higher accuracy than the real-world training dataset, CASIA-WebFace.
arXiv Detail & Related papers (2025-07-23T04:34:56Z) - FLUXSynID: A Framework for Identity-Controlled Synthetic Face Generation with Document and Live Images [0.0]
We introduce FLUXSynID, a framework for generating high-resolution synthetic face datasets.<n>We generate synthetic faces with user-defined identity attribute distributions, offering both document-style and trusted live capture images.<n>Our work is publicly released to support biometric research, including face recognition and morphing attack detection.
arXiv Detail & Related papers (2025-05-12T13:12:33Z) - ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition [60.15830516741776]
Synthetic face recognition (SFR) aims to generate datasets that mimic the distribution of real face data.
We introduce a diffusion-fueled SFR model termed $textID3$.
$textID3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances.
arXiv Detail & Related papers (2024-09-26T06:46:40Z) - Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training [51.87027943520492]
We present a novel paradigm Diffusion-ReID to efficiently augment and generate diverse images based on known identities.
Benefiting from our proposed paradigm, we first create a new large-scale person Re-ID dataset Diff-Person, which consists of over 777K images from 5,183 identities.
arXiv Detail & Related papers (2024-06-10T06:26:03Z) - SFace: Privacy-friendly and Accurate Face Recognition using Synthetic
Data [9.249824128880707]
We propose and investigate the feasibility of using a privacy-friendly synthetically generated face dataset to train face recognition models.
To address the privacy aspect of using such data to train a face recognition model, we provide extensive evaluation experiments on the identity relation between the synthetic dataset and the original authentic dataset used to train the generative model.
We also propose to train face recognition on our privacy-friendly dataset, SFace, using three different learning strategies, multi-class classification, label-free knowledge transfer, and combined learning of multi-class classification and knowledge transfer.
arXiv Detail & Related papers (2022-06-21T16:42:04Z) - Camera-aware Proxies for Unsupervised Person Re-Identification [60.26031011794513]
This paper tackles the purely unsupervised person re-identification (Re-ID) problem that requires no annotations.
We propose to split each single cluster into multiple proxies and each proxy represents the instances coming from the same camera.
Based on the camera-aware proxies, we design both intra- and inter-camera contrastive learning components for our Re-ID model.
arXiv Detail & Related papers (2020-12-19T12:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.