A Method for Curation of Web-Scraped Face Image Datasets
- URL: http://arxiv.org/abs/2004.03074v1
- Date: Tue, 7 Apr 2020 01:57:32 GMT
- Title: A Method for Curation of Web-Scraped Face Image Datasets
- Authors: Kai Zhang, V\'itor Albiero and Kevin W. Bowyer
- Abstract summary: A variety of issues occur when collecting a dataset in-the-wild.
With the number of images being in the millions, a manual cleaning procedure is not feasible.
We propose a semi-automated method, where the goal is to have a clean dataset for testing face recognition methods.
- Score: 13.893682217746816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Web-scraped, in-the-wild datasets have become the norm in face recognition
research. The numbers of subjects and images acquired in web-scraped datasets
are usually very large, with number of images on the millions scale. A variety
of issues occur when collecting a dataset in-the-wild, including images with
the wrong identity label, duplicate images, duplicate subjects and variation in
quality. With the number of images being in the millions, a manual cleaning
procedure is not feasible. But fully automated methods used to date result in a
less-than-ideal level of clean dataset. We propose a semi-automated method,
where the goal is to have a clean dataset for testing face recognition methods,
with similar quality across men and women, to support comparison of accuracy
across gender. Our approach removes near-duplicate images, merges duplicate
subjects, corrects mislabeled images, and removes images outside a defined
range of pose and quality. We conduct the curation on the Asian Face Dataset
(AFD) and VGGFace2 test dataset. The experiments show that a state-of-the-art
method achieves a much higher accuracy on the datasets after they are curated.
Finally, we release our cleaned versions of both datasets to the research
community.
Related papers
- Semantic Contextualization of Face Forgery: A New Definition, Dataset, and Detection Method [77.65459419417533]
We put face forgery in a semantic context and define that computational methods that alter semantic face attributes are sources of face forgery.
We construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph.
We propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task.
arXiv Detail & Related papers (2024-05-14T10:24:19Z) - Double Trouble? Impact and Detection of Duplicates in Face Image
Datasets [7.092869001331781]
Face image datasets intended for facial biometrics research were created via web-scraping.
This work presents an approach to detect both exactly and nearly identical face image duplicates.
arXiv Detail & Related papers (2024-01-25T11:10:13Z) - Multi-Task Faces (MTF) Data Set: A Legally and Ethically Compliant
Collection of Face Images for Various Classification Tasks [3.1133049660590615]
Recent privacy regulations have restricted the ways in which human images may be collected and used for research.
Several previously published data sets containing human faces have been removed from the internet due to inadequate data collection methods.
We present the Multi-Task Faces (MTF) image data set, a meticulously curated collection of face images designed for various classification tasks.
arXiv Detail & Related papers (2023-11-20T16:19:46Z) - Diverse, Difficult, and Odd Instances (D2O): A New Test Set for Object
Classification [47.64219291655723]
We introduce a new test set, called D2O, which is sufficiently different from existing test sets.
Our dataset contains 8,060 images spread across 36 categories, out of which 29 appear in ImageNet.
The best Top-1 accuracy on our dataset is around 60% which is much lower than 91% best Top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2023-01-29T19:58:32Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to
Parcel Logistics [58.720142291102135]
We present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps.
We first scrape images for the objects of interest from popular image search engines.
We compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection.
arXiv Detail & Related papers (2022-10-18T12:49:04Z) - Personalized Image Semantic Segmentation [58.980245748434]
We generate more accurate segmentation results on unlabeled personalized images by investigating the data's personalized traits.
We propose a baseline method that incorporates the inter-image context when segmenting certain images.
The code and the PIS dataset will be made publicly available.
arXiv Detail & Related papers (2021-07-24T04:03:11Z) - Machine learning with limited data [1.2183405753834562]
We study few shot image classification, in which we only have very few labeled data.
One method is to augment image features by mixing the style of these images.
The second method is applying spatial attention to explore the relations between patches of images.
arXiv Detail & Related papers (2021-01-18T17:10:39Z) - Semi-supervised Learning for Few-shot Image-to-Image Translation [89.48165936436183]
We propose a semi-supervised method for few-shot image translation, called SEMIT.
Our method achieves excellent results on four different datasets using as little as 10% of the source labels.
arXiv Detail & Related papers (2020-03-30T22:46:49Z) - Dataset Cleaning -- A Cross Validation Methodology for Large Facial
Datasets using Face Recognition [0.40611352512781856]
In recent years, large "in the wild" face datasets have been released in an attempt to facilitate progress in tasks such as face detection, face recognition, and other tasks.
Due to the automatic way of gathering these datasets and due to their large size, many identities folder contain mislabeled samples which deteriorates the quality of the datasets.
In this work, it is presented a semi-automatic method for cleaning the noisy large face datasets with the use of face recognition.
arXiv Detail & Related papers (2020-03-24T13:01:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.