Efficient Deduplication and Leakage Detection in Large Scale Image
Datasets with a focus on the CrowdAI Mapping Challenge Dataset
- URL: http://arxiv.org/abs/2304.02296v1
- Date: Wed, 5 Apr 2023 08:36:17 GMT
- Title: Efficient Deduplication and Leakage Detection in Large Scale Image
Datasets with a focus on the CrowdAI Mapping Challenge Dataset
- Authors: Yeshwanth Kumar Adimoolam, Bodhiswatta Chatterjee, Charalambos
Poullis, Melinos Averkiou
- Abstract summary: We propose a drop-in pipeline that employs perceptual hashing techniques for efficient de-duplication of the dataset.
In our experiments, we demonstrate that nearly 250k($ sim $90%) images in the training split were identical.
Our analysis on the validation split demonstrates that roughly 56k of the 60k images also appear in the training split, resulting in a data leakage of 93%.
- Score: 5.149242555705579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in deep learning and computer vision have led to
widespread use of deep neural networks to extract building footprints from
remote-sensing imagery. The success of such methods relies on the availability
of large databases of high-resolution remote sensing images with high-quality
annotations. The CrowdAI Mapping Challenge Dataset is one of these datasets
that has been used extensively in recent years to train deep neural networks.
This dataset consists of $ \sim\ $280k training images and $ \sim\ $60k testing
images, with polygonal building annotations for all images. However, issues
such as low-quality and incorrect annotations, extensive duplication of image
samples, and data leakage significantly reduce the utility of deep neural
networks trained on the dataset. Therefore, it is an imperative pre-condition
to adopt a data validation pipeline that evaluates the quality of the dataset
prior to its use. To this end, we propose a drop-in pipeline that employs
perceptual hashing techniques for efficient de-duplication of the dataset and
identification of instances of data leakage between training and testing
splits. In our experiments, we demonstrate that nearly 250k($ \sim\ $90%)
images in the training split were identical. Moreover, our analysis on the
validation split demonstrates that roughly 56k of the 60k images also appear in
the training split, resulting in a data leakage of 93%. The source code used
for the analysis and de-duplication of the CrowdAI Mapping Challenge dataset is
publicly available at https://github.com/yeshwanth95/CrowdAI_Hash_and_search .
Related papers
- DataDAM: Efficient Dataset Distillation with Attention Matching [15.300968899043498]
Researchers have long tried to minimize training costs in deep learning by maintaining strong generalization across diverse datasets.
Emerging research on dataset aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset.
However, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data.
arXiv Detail & Related papers (2023-09-29T19:07:48Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - PromptMix: Text-to-image diffusion models enhance the performance of
lightweight networks [83.08625720856445]
Deep learning tasks require annotations that are too time consuming for human operators.
In this paper, we introduce PromptMix, a method for artificially boosting the size of existing datasets.
We show that PromptMix can significantly increase the performance of lightweight networks by up to 26%.
arXiv Detail & Related papers (2023-01-30T14:15:47Z) - Supervised and Contrastive Self-Supervised In-Domain Representation
Learning for Dense Prediction Problems in Remote Sensing [0.0]
This paper explores the effectiveness of in-domain representations in both supervised and self-supervised forms to solve the domain difference between remote sensing and the ImageNet dataset.
For self-supervised pre-training, we have utilized the SimSiam algorithm as it is simple and does not need huge computational resources.
Our results have demonstrated that using datasets with a high spatial resolution for self-supervised representation learning leads to high performance in downstream tasks.
arXiv Detail & Related papers (2023-01-29T20:56:51Z) - Learning Co-segmentation by Segment Swapping for Retrieval and Discovery [67.6609943904996]
The goal of this work is to efficiently identify visually similar patterns from a pair of images.
We generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image.
We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset.
arXiv Detail & Related papers (2021-10-29T16:51:16Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - Data Augmentation for Object Detection via Differentiable Neural
Rendering [71.00447761415388]
It is challenging to train a robust object detector when annotated data is scarce.
Existing approaches to tackle this problem include semi-supervised learning that interpolates labeled data from unlabeled data.
We introduce an offline data augmentation method for object detection, which semantically interpolates the training data with novel views.
arXiv Detail & Related papers (2021-03-04T06:31:06Z) - Single Image Cloud Detection via Multi-Image Fusion [23.641624507709274]
A primary challenge in developing algorithms is the cost of collecting annotated training data.
We demonstrate how recent advances in multi-image fusion can be leveraged to bootstrap single image cloud detection.
We collect a large dataset of Sentinel-2 images along with a per-pixel semantic labelling for land cover.
arXiv Detail & Related papers (2020-07-29T22:52:28Z) - Complex Wavelet SSIM based Image Data Augmentation [0.0]
We look at the MNIST handwritten dataset an image dataset used for digit recognition.
We take a detailed look into one of the most popular augmentation techniques used for this data set elastic deformation.
We propose to use a similarity measure called Complex Wavelet Structural Similarity Index Measure (CWSSIM) to selectively filter out the irrelevant data.
arXiv Detail & Related papers (2020-07-11T21:11:46Z) - From ImageNet to Image Classification: Contextualizing Progress on
Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset.
Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z) - Data Consistent CT Reconstruction from Insufficient Data with Learned
Prior Images [70.13735569016752]
We investigate the robustness of deep learning in CT image reconstruction by showing false negative and false positive lesion cases.
We propose a data consistent reconstruction (DCR) method to improve their image quality, which combines the advantages of compressed sensing and deep learning.
The efficacy of the proposed method is demonstrated in cone-beam CT with truncated data, limited-angle data and sparse-view data, respectively.
arXiv Detail & Related papers (2020-05-20T13:30:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.