Double Trouble? Impact and Detection of Duplicates in Face Image
Datasets
- URL: http://arxiv.org/abs/2401.14088v1
- Date: Thu, 25 Jan 2024 11:10:13 GMT
- Title: Double Trouble? Impact and Detection of Duplicates in Face Image
Datasets
- Authors: Torsten Schlett, Christian Rathgeb, Juan Tapia, Christoph Busch
- Abstract summary: Face image datasets intended for facial biometrics research were created via web-scraping.
This work presents an approach to detect both exactly and nearly identical face image duplicates.
- Score: 7.092869001331781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various face image datasets intended for facial biometrics research were
created via web-scraping, i.e. the collection of images publicly available on
the internet. This work presents an approach to detect both exactly and nearly
identical face image duplicates, using file and image hashes. The approach is
extended through the use of face image preprocessing. Additional steps based on
face recognition and face image quality assessment models reduce false
positives, and facilitate the deduplication of the face images both for intra-
and inter-subject duplicate sets. The presented approach is applied to five
datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a
cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset,
with hundreds to hundreds of thousands of duplicates for all except LFW. Face
recognition and quality assessment experiments indicate a minor impact on the
results through the duplicate removal. The final deduplication data is publicly
available.
Related papers
- Semantic Contextualization of Face Forgery: A New Definition, Dataset, and Detection Method [77.65459419417533]
We put face forgery in a semantic context and define that computational methods that alter semantic face attributes are sources of face forgery.
We construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph.
We propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task.
arXiv Detail & Related papers (2024-05-14T10:24:19Z) - DiffusionFace: Towards a Comprehensive Dataset for Diffusion-Based Face Forgery Analysis [71.40724659748787]
DiffusionFace is the first diffusion-based face forgery dataset.
It covers various forgery categories, including unconditional and Text Guide facial image generation, Img2Img, Inpaint, and Diffusion-based facial exchange algorithms.
It provides essential metadata and a real-world internet-sourced forgery facial image dataset for evaluation.
arXiv Detail & Related papers (2024-03-27T11:32:44Z) - Arc2Face: A Foundation Model of Human Faces [95.00331107591859]
Arc2Face is an identity-conditioned face foundation model.
It can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models.
arXiv Detail & Related papers (2024-03-18T10:32:51Z) - FACE-AUDITOR: Data Auditing in Facial Recognition Systems [24.082527732931677]
Few-shot-based facial recognition systems have gained increasing attention due to their scalability and ability to work with a few face images.
To prevent the face images from being misused, one straightforward approach is to modify the raw face images before sharing them.
We propose a complete toolkit FACE-AUDITOR that can query the few-shot-based facial recognition model and determine whether any of a user's face images is used in training the model.
arXiv Detail & Related papers (2023-04-05T23:03:54Z) - FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders [81.21440457805932]
We propose a novel framework FaceMAE, where the face privacy and recognition performance are considered simultaneously.
randomly masked face images are used to train the reconstruction module in FaceMAE.
We also perform sufficient privacy-preserving face recognition on several public face datasets.
arXiv Detail & Related papers (2022-05-23T07:19:42Z) - Reliable Detection of Doppelg\"angers based on Deep Face Representations [14.832145647643848]
We assess the impact of doppelg"angers on the HDA Doppelg"anger and Disguised Faces in The Wild databases.
It is found that doppelg"anger image pairs yield very high similarity scores resulting in a significant increase of false match rates.
We propose a doppelg"anger detection method which distinguishes doppelg"angers from mated comparison trials.
arXiv Detail & Related papers (2022-01-21T18:37:08Z) - FaceOcc: A Diverse, High-quality Face Occlusion Dataset for Human Face
Extraction [3.8502825594372703]
Occlusions often occur in face images in the wild, troubling face-related tasks such as landmark detection, 3D reconstruction, and face recognition.
This paper proposes a novel face segmentation dataset with manually labeled face occlusions from the CelebA-HQ and the internet.
We trained a straightforward face segmentation model but obtained SOTA performance, convincingly demonstrating the effectiveness of the proposed dataset.
arXiv Detail & Related papers (2022-01-20T19:44:18Z) - End2End Occluded Face Recognition by Masking Corrupted Features [82.27588990277192]
State-of-the-art general face recognition models do not generalize well to occluded face images.
This paper presents a novel face recognition method that is robust to occlusions based on a single end-to-end deep neural network.
Our approach, named FROM (Face Recognition with Occlusion Masks), learns to discover the corrupted features from the deep convolutional neural networks, and clean them by the dynamically learned masks.
arXiv Detail & Related papers (2021-08-21T09:08:41Z) - SynFace: Face Recognition with Synthetic Data [83.15838126703719]
We devise the SynFace with identity mixup (IM) and domain mixup (DM) to mitigate the performance gap.
We also perform a systematically empirical analysis on synthetic face images to provide some insights on how to effectively utilize synthetic data for face recognition.
arXiv Detail & Related papers (2021-08-18T03:41:54Z) - When Face Recognition Meets Occlusion: A New Benchmark [37.616211206620854]
We create a simulated occlusion face recognition dataset.
It covers 804,704 face images of 10,575 subjects.
Our dataset significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2021-03-04T03:07:42Z) - A Method for Curation of Web-Scraped Face Image Datasets [13.893682217746816]
A variety of issues occur when collecting a dataset in-the-wild.
With the number of images being in the millions, a manual cleaning procedure is not feasible.
We propose a semi-automated method, where the goal is to have a clean dataset for testing face recognition methods.
arXiv Detail & Related papers (2020-04-07T01:57:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.