Seeing without Looking: Analysis Pipeline for Child Sexual Abuse
Datasets
- URL: http://arxiv.org/abs/2204.14110v1
- Date: Fri, 29 Apr 2022 14:02:42 GMT
- Title: Seeing without Looking: Analysis Pipeline for Child Sexual Abuse
Datasets
- Authors: Camila Laranjeira, Jo\~ao Macedo, Sandra Avila, Jefersson A. dos
Santos
- Abstract summary: We propose an analysis template that goes beyond the statistics of the dataset and respective labels.
It focuses on the extraction of automatic signals, provided both by pre-trained machine learning models.
Our goal is to safely publicize the characteristics of CSAM datasets, encouraging researchers to join the field.
- Score: 9.016916087221801
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The online sharing and viewing of Child Sexual Abuse Material (CSAM) are
growing fast, such that human experts can no longer handle the manual
inspection. However, the automatic classification of CSAM is a challenging
field of research, largely due to the inaccessibility of target data that is -
and should forever be - private and in sole possession of law enforcement
agencies. To aid researchers in drawing insights from unseen data and safely
providing further understanding of CSAM images, we propose an analysis template
that goes beyond the statistics of the dataset and respective labels. It
focuses on the extraction of automatic signals, provided both by pre-trained
machine learning models, e.g., object categories and pornography detection, as
well as image metrics such as luminance and sharpness. Only aggregated
statistics of sparse signals are provided to guarantee the anonymity of
children and adolescents victimized. The pipeline allows filtering the data by
applying thresholds to each specified signal and provides the distribution of
such signals within the subset, correlations between signals, as well as a bias
evaluation. We demonstrated our proposal on the Region-based annotated Child
Pornography Dataset (RCPD), one of the few CSAM benchmarks in the literature,
composed of over 2000 samples among regular and CSAM images, produced in
partnership with Brazil's Federal Police. Although noisy and limited in several
senses, we argue that automatic signals can highlight important aspects of the
overall distribution of data, which is valuable for databases that can not be
disclosed. Our goal is to safely publicize the characteristics of CSAM
datasets, encouraging researchers to join the field and perhaps other
institutions to provide similar reports on their benchmarks.
Related papers
- SIG: A Synthetic Identity Generation Pipeline for Generating Evaluation Datasets for Face Recognition [0.0]
We introduce the Synthetic Identity Generation pipeline, or SIG, that allows for the targeted creation of ethical, balanced datasets for face recognition evaluation.
Our pipeline generates high-quality images of synthetic identities with controllable pose, facial features, and demographic attributes, such as race, gender, and age.
We also release an open-source evaluation dataset named ControlFace10k, consisting of 10,008 face images of 3,336 unique synthetic identities balanced across race, gender, and age.
arXiv Detail & Related papers (2024-09-12T18:18:02Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Detecting sexually explicit content in the context of the child sexual abuse materials (CSAM): end-to-end classifiers and region-based networks [0.0]
Child sexual abuse materials (CSAM) pose a significant threat to the safety and well-being of children worldwide.
This study presents methods for classifying sexually explicit content, which plays a crucial role in the automated CSAM detection system.
arXiv Detail & Related papers (2024-06-20T09:21:08Z) - Leveraging Synthetic Data for Generalizable and Fair Facial Action Unit Detection [9.404202619102943]
We propose to use synthetically generated data and multi-source domain adaptation (MSDA) to address the problems of the scarcity of labeled data and the diversity of subjects.
Specifically, we propose to generate a diverse dataset through synthetic facial expression re-targeting.
To further improve gender fairness, PM2 matches the features of the real data with a female and a male synthetic image.
arXiv Detail & Related papers (2024-03-15T23:50:18Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - AI-based Re-identification of Behavioral Clickstream Data [0.0]
This paper demonstrates that similar techniques can be applied to successfully re-identify individuals purely based on their behavioral patterns.
The mere resemblance of behavioral patterns between records is sufficient to correctly attribute behavioral data to identified individuals.
We also demonstrate how synthetic data can offer a viable alternative, that is shown to be resilient against our introduced AI-based re-identification attacks.
arXiv Detail & Related papers (2022-01-21T16:49:00Z) - Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for
Open-Set Semi-Supervised Learning [101.28281124670647]
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data.
We propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning.
Our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-08-12T09:14:44Z) - SCARF: Self-Supervised Contrastive Learning using Random Feature
Corruption [72.35532598131176]
We propose SCARF, a technique for contrastive learning, where views are formed by corrupting a random subset of features.
We show that SCARF complements existing strategies and outperforms alternatives like autoencoders.
arXiv Detail & Related papers (2021-06-29T08:08:33Z) - Metadata-Based Detection of Child Sexual Abuse Material [1.1470070927586016]
Child Sexual Abuse Media (CSAM) is any visual record of a sexually-explicit activity involving minors.
We propose a framework for training and evaluating deployment-ready machine learning models for CSAM identification.
arXiv Detail & Related papers (2020-10-05T23:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.