Related papers: Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability

Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability

URL: http://arxiv.org/abs/2504.14446v1
Date: Sun, 20 Apr 2025 01:36:07 GMT
Title: Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability
Authors: Carlos Caetano, Gabriel O. dos Santos, Caio Petrucci, Artur Barros, Camila Laranjeira, Leo S. F. Ribeiro, Júlia F. de Mendonça, Jefersson A. dos Santos, Sandra Avila,
Abstract summary: Including children's images in datasets has raised ethical concerns.<n>These datasets can expose children to risks such as exploitation, profiling, and tracking.<n>We propose a pipeline to detect and remove such images.
Score: 6.366871989491978
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Including children's images in datasets has raised ethical concerns, particularly regarding privacy, consent, data protection, and accountability. These datasets, often built by scraping publicly available images from the Internet, can expose children to risks such as exploitation, profiling, and tracking. Despite the growing recognition of these issues, approaches for addressing them remain limited. We explore the ethical implications of using children's images in AI datasets and propose a pipeline to detect and remove such images. As a use case, we built the pipeline on a Vision-Language Model under the Visual Question Answering task and tested it on the #PraCegoVer dataset. We also evaluate the pipeline on a subset of 100,000 images from the Open Images V7 dataset to assess its effectiveness in detecting and removing images of children. The pipeline serves as a baseline for future research, providing a starting point for more comprehensive tools and methodologies. While we leverage existing models trained on potentially problematic data, our goal is to expose and address this issue. We do not advocate for training or deploying such models, but instead call for urgent community reflection and action to protect children's rights. Ultimately, we aim to encourage the research community to exercise - more than an additional - care in creating new datasets and to inspire the development of tools to protect the fundamental rights of vulnerable groups, particularly children.

Related papers

Person Re-Identification without Identification via Event Anonymization [23.062038973576296]
Deep learning has been able to reconstruct images from event cameras with high fidelity, reintroducing a potential threat to privacy for event-based vision applications. We propose an end-to-end network architecture jointly optimized for the twofold objective of preserving privacy and performing a downstream task such as person ReId.
arXiv Detail & Related papers (2023-08-08T17:04:53Z)
ConfounderGAN: Protecting Image Data Privacy with Causal Confounder [85.6757153033139]
We propose ConfounderGAN, a generative adversarial network (GAN) that can make personal image data unlearnable to protect the data privacy of its owners. Experiments are conducted in six image classification datasets, consisting of three natural object datasets and three medical datasets.
arXiv Detail & Related papers (2022-12-04T08:49:14Z)
Hiding Visual Information via Obfuscating Adversarial Perturbations [47.315523613407244]
We propose an adversarial visual information hiding method to protect the visual privacy of data. Specifically, the method generates obfuscating adversarial perturbations to obscure the visual information of the data. Experimental results on the recognition and classification tasks demonstrate that the proposed method can effectively hide visual information.
arXiv Detail & Related papers (2022-09-30T08:23:26Z)
Learning to See by Looking at Noise [87.12788334473295]
We investigate a suite of image generation models that produce images from simple random processes. These are then used as training data for a visual representation learner with a contrastive loss. Our findings show that it is important for the noise to capture certain structural properties of real data but that good performance can be achieved even with processes that are far from realistic.
arXiv Detail & Related papers (2021-06-10T17:56:46Z)
Curious Representation Learning for Embodied Intelligence [81.21764276106924]
Self-supervised representation learning has achieved remarkable success in recent years. Yet to build truly intelligent agents, we must construct representation learning algorithms that can learn from environments. We propose a framework, curious representation learning, which jointly learns a reinforcement learning policy and a visual representation model.
arXiv Detail & Related papers (2021-05-03T17:59:20Z)
Data Augmentation for Object Detection via Differentiable Neural Rendering [71.00447761415388]
It is challenging to train a robust object detector when annotated data is scarce. Existing approaches to tackle this problem include semi-supervised learning that interpolates labeled data from unlabeled data. We introduce an offline data augmentation method for object detection, which semantically interpolates the training data with novel views.
arXiv Detail & Related papers (2021-03-04T06:31:06Z)
Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response [5.610924570214424]
We propose new datasets for disaster type detection, informativeness classification, and damage severity assessment. We benchmark several state-of-the-art deep learning models and achieve promising results. We release our datasets and models publicly, aiming to provide proper baselines as well as to spur further research in the crisis informatics community.
arXiv Detail & Related papers (2020-11-17T20:15:49Z)
Improving Object Detection with Selective Self-supervised Self-training [62.792445237541145]
We study how to leverage Web images to augment human-curated object detection datasets. We retrieve Web images by image-to-image search, which incurs less domain shift from the curated data than other search methods. We propose a novel learning method motivated by two parallel lines of work that explore unlabeled data for image classification.
arXiv Detail & Related papers (2020-07-17T18:05:01Z)
Large image datasets: A pyrrhic win for computer vision? [2.627046865670577]
We investigate problematic practices and consequences of large scale vision datasets. We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets.
arXiv Detail & Related papers (2020-06-24T06:41:32Z)
From ImageNet to Image Classification: Contextualizing Progress on Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset. Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z)
Privacy-Preserving Image Classification in the Local Setting [17.375582978294105]
Local Differential Privacy (LDP) brings us a promising solution, which allows the data owners to randomly perturb their input to provide the plausible deniability of the data before releasing. In this paper, we consider a two-party image classification problem, in which data owners hold the image and the untrustworthy data user would like to fit a machine learning model with these images as input. We propose a supervised image feature extractor, DCAConv, which produces an image representation with scalable domain size.
arXiv Detail & Related papers (2020-02-09T01:25:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.