Related papers: Efficient Deduplication and Leakage Detection in Large Scale Image Datasets with a focus on the CrowdAI Mapping Challenge Dataset

Efficient Deduplication and Leakage Detection in Large Scale Image Datasets with a focus on the CrowdAI Mapping Challenge Dataset

URL: http://arxiv.org/abs/2304.02296v1
Date: Wed, 5 Apr 2023 08:36:17 GMT
Title: Efficient Deduplication and Leakage Detection in Large Scale Image Datasets with a focus on the CrowdAI Mapping Challenge Dataset
Authors: Yeshwanth Kumar Adimoolam, Bodhiswatta Chatterjee, Charalambos Poullis, Melinos Averkiou
Abstract summary: We propose a drop-in pipeline that employs perceptual hashing techniques for efficient de-duplication of the dataset. In our experiments, we demonstrate that nearly 250k($ sim $90%) images in the training split were identical. Our analysis on the validation split demonstrates that roughly 56k of the 60k images also appear in the training split, resulting in a data leakage of 93%.
Score: 5.149242555705579
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in deep learning and computer vision have led to widespread use of deep neural networks to extract building footprints from remote-sensing imagery. The success of such methods relies on the availability of large databases of high-resolution remote sensing images with high-quality annotations. The CrowdAI Mapping Challenge Dataset is one of these datasets that has been used extensively in recent years to train deep neural networks. This dataset consists of $ \sim\ $280k training images and $ \sim\ $60k testing images, with polygonal building annotations for all images. However, issues such as low-quality and incorrect annotations, extensive duplication of image samples, and data leakage significantly reduce the utility of deep neural networks trained on the dataset. Therefore, it is an imperative pre-condition to adopt a data validation pipeline that evaluates the quality of the dataset prior to its use. To this end, we propose a drop-in pipeline that employs perceptual hashing techniques for efficient de-duplication of the dataset and identification of instances of data leakage between training and testing splits. In our experiments, we demonstrate that nearly 250k($ \sim\ $90%) images in the training split were identical. Moreover, our analysis on the validation split demonstrates that roughly 56k of the 60k images also appear in the training split, resulting in a data leakage of 93%. The source code used for the analysis and de-duplication of the CrowdAI Mapping Challenge dataset is publicly available at https://github.com/yeshwanth95/CrowdAI_Hash_and_search .

Related papers

DataDAM: Efficient Dataset Distillation with Attention Matching [15.300968899043498]
Researchers have long tried to minimize training costs in deep learning by maintaining strong generalization across diverse datasets. Emerging research on dataset aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset. However, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data.
arXiv Detail & Related papers (2023-09-29T19:07:48Z)
Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
PromptMix: Text-to-image diffusion models enhance the performance of lightweight networks [83.08625720856445]
Deep learning tasks require annotations that are too time consuming for human operators. In this paper, we introduce PromptMix, a method for artificially boosting the size of existing datasets. We show that PromptMix can significantly increase the performance of lightweight networks by up to 26%.
arXiv Detail & Related papers (2023-01-30T14:15:47Z)
Supervised and Contrastive Self-Supervised In-Domain Representation Learning for Dense Prediction Problems in Remote Sensing [0.0]
This paper explores the effectiveness of in-domain representations in both supervised and self-supervised forms to solve the domain difference between remote sensing and the ImageNet dataset. For self-supervised pre-training, we have utilized the SimSiam algorithm as it is simple and does not need huge computational resources. Our results have demonstrated that using datasets with a high spatial resolution for self-supervised representation learning leads to high performance in downstream tasks.
arXiv Detail & Related papers (2023-01-29T20:56:51Z)
Learning Co-segmentation by Segment Swapping for Retrieval and Discovery [67.6609943904996]
The goal of this work is to efficiently identify visually similar patterns from a pair of images. We generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image. We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset.
arXiv Detail & Related papers (2021-10-29T16:51:16Z)
AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures. Our experiments demonstrate that the method is able to represent the image in low dimensional space. Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z)
Data Augmentation for Object Detection via Differentiable Neural Rendering [71.00447761415388]
It is challenging to train a robust object detector when annotated data is scarce. Existing approaches to tackle this problem include semi-supervised learning that interpolates labeled data from unlabeled data. We introduce an offline data augmentation method for object detection, which semantically interpolates the training data with novel views.
arXiv Detail & Related papers (2021-03-04T06:31:06Z)
Single Image Cloud Detection via Multi-Image Fusion [23.641624507709274]
A primary challenge in developing algorithms is the cost of collecting annotated training data. We demonstrate how recent advances in multi-image fusion can be leveraged to bootstrap single image cloud detection. We collect a large dataset of Sentinel-2 images along with a per-pixel semantic labelling for land cover.
arXiv Detail & Related papers (2020-07-29T22:52:28Z)
Complex Wavelet SSIM based Image Data Augmentation [0.0]
We look at the MNIST handwritten dataset an image dataset used for digit recognition. We take a detailed look into one of the most popular augmentation techniques used for this data set elastic deformation. We propose to use a similarity measure called Complex Wavelet Structural Similarity Index Measure (CWSSIM) to selectively filter out the irrelevant data.
arXiv Detail & Related papers (2020-07-11T21:11:46Z)
From ImageNet to Image Classification: Contextualizing Progress on Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset. Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z)
Data Consistent CT Reconstruction from Insufficient Data with Learned Prior Images [70.13735569016752]
We investigate the robustness of deep learning in CT image reconstruction by showing false negative and false positive lesion cases. We propose a data consistent reconstruction (DCR) method to improve their image quality, which combines the advantages of compressed sensing and deep learning. The efficacy of the proposed method is demonstrated in cone-beam CT with truncated data, limited-angle data and sparse-view data, respectively.
arXiv Detail & Related papers (2020-05-20T13:30:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.