Short Text Classification Approach to Identify Child Sexual Exploitation
Material
- URL: http://arxiv.org/abs/2011.01113v2
- Date: Fri, 13 Nov 2020 09:39:29 GMT
- Title: Short Text Classification Approach to Identify Child Sexual Exploitation
Material
- Authors: Mhd Wesam Al-Nabki, Eduardo Fidalgo, Enrique Alegre, Roc\'io
Alaiz-Rodr\'iguez
- Abstract summary: This paper presents two approaches based on short text classification to identify Child Sexual Exploitation Material (CSEM) files.
The presented solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content.
- Score: 4.415977307120616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Producing or sharing Child Sexual Exploitation Material (CSEM) is a serious
crime fought vigorously by Law Enforcement Agencies (LEAs). When an LEA seizes
a computer from a potential producer or consumer of CSEM, they need to analyze
the suspect's hard disk's files looking for pieces of evidence. However, a
manual inspection of the file content looking for CSEM is a time-consuming
task. In most cases, it is unfeasible in the amount of time available for the
Spanish police using a search warrant. Instead of analyzing its content,
another approach that can be used to speed up the process is to identify CSEM
by analyzing the file names and their absolute paths. The main challenge for
this task lies behind dealing with short text distorted deliberately by the
owners of this material using obfuscated words and user-defined naming
patterns. This paper presents and compares two approaches based on short text
classification to identify CSEM files. The first one employs two independent
supervised classifiers, one for the file name and the other for the path, and
their outputs are later on fused into a single score. Conversely, the second
approach uses only the file name classifier to iterate over the file's absolute
path. Both approaches operate at the character n-grams level, while binary and
orthographic features enrich the file name representation, and a binary
Logistic Regression model is used for classification. The presented file
classifier achieved an average class recall of 0.98. This solution could be
integrated into forensic tools and services to support Law Enforcement Agencies
to identify CSEM without tackling every file's visual content, which is
computationally much more highly demanding.
Related papers
- Document Type Classification using File Names [7.130525292849283]
Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification.
Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets.
We present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method.
arXiv Detail & Related papers (2024-10-02T01:42:19Z) - African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification [53.89380284760555]
textttFOCI (textbfFine-grained textbfObject textbfClasstextbfIfication) is a difficult multiple-choice benchmark for fine-grained object classification.
textttFOCIxspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k.
arXiv Detail & Related papers (2024-06-20T16:59:39Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Zero-shot Audio Topic Reranking using Large Language Models [42.774019015099704]
Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval.
This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches.
Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus.
arXiv Detail & Related papers (2023-09-14T11:13:36Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain [0.0]
A new focused crawler is proposed called ThreatCrawl.
It uses BiBERT-based models to classify documents and adapt its crawling path dynamically.
It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art.
arXiv Detail & Related papers (2023-04-24T09:53:33Z) - Adversarial Networks and Machine Learning for File Classification [0.0]
Correctly identifying the type of file under examination is a critical part of a forensic investigation.
We propose using an adversarially-trained machine learning neural network to determine a file's true type.
Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types.
arXiv Detail & Related papers (2023-01-27T19:40:03Z) - Same or Different? Diff-Vectors for Authorship Analysis [78.83284164605473]
In classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document.
Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd.
arXiv Detail & Related papers (2023-01-24T08:48:12Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Content-Based Textual File Type Detection at Scale [0.0]
Programming language detection is a common need in the analysis of large source code bases.
We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content.
arXiv Detail & Related papers (2021-01-21T09:08:42Z) - R$^2$-Net: Relation of Relation Learning Network for Sentence Semantic
Matching [58.72111690643359]
We propose a Relation of Relation Learning Network (R2-Net) for sentence semantic matching.
We first employ BERT to encode the input sentences from a global perspective.
Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective.
To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task.
arXiv Detail & Related papers (2020-12-16T13:11:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.