Related papers: Illicit Darkweb Classification via Natural-language Processing: Classifying Illicit Content of Webpages based on Textual Information

Illicit Darkweb Classification via Natural-language Processing: Classifying Illicit Content of Webpages based on Textual Information

URL: http://arxiv.org/abs/2312.04944v1
Date: Fri, 8 Dec 2023 10:19:48 GMT
Title: Illicit Darkweb Classification via Natural-language Processing: Classifying Illicit Content of Webpages based on Textual Information
Authors: Giuseppe Cascavilla, Gemma Catolino, Mirella Sangiovanni
Abstract summary: This work aims at expanding previous works done in the context of illegal activities classification. We created a heterogeneous dataset of 113995 onion sites and dark marketplaces. We developed two illegal activities classification approaches, one for illicit content on the Dark Web and one for identifying the specific types of drugs.
Score: 4.005483185111992
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This work aims at expanding previous works done in the context of illegal activities classification, performing three different steps. First, we created a heterogeneous dataset of 113995 onion sites and dark marketplaces. Then, we compared pre-trained transferable models, i.e., ULMFit (Universal Language Model Fine-tuning), Bert (Bidirectional Encoder Representations from Transformers), and RoBERTa (Robustly optimized BERT approach) with a traditional text classification approach like LSTM (Long short-term memory) neural networks. Finally, we developed two illegal activities classification approaches, one for illicit content on the Dark Web and one for identifying the specific types of drugs. Results show that Bert obtained the best approach, classifying the dark web's general content and the types of Drugs with 96.08% and 91.98% of accuracy.

Related papers

A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms [9.521604326086608]
This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy.<n>We extract semantic representations using ModernBERT, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes.<n>We incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata.
arXiv Detail & Related papers (2025-07-19T05:54:52Z)
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z)
Towards Synchronous Memorizability and Generalizability with Site-Modulated Diffusion Replay for Cross-Site Continual Segmentation [50.70671908078593]
This paper proposes a novel training paradigm, learning towards Synchronous Memorizability and Generalizability (SMG-Learning) We create the orientational gradient alignment to ensure memorizability on previous sites, and arbitrary gradient alignment to enhance generalizability on unseen sites. Experimental results show that our method efficiently enhances both memorizability and generalizablity better than other state-of-the-art methods.
arXiv Detail & Related papers (2024-06-26T03:10:57Z)
Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination. Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z)
Generative Multi-modal Models are Good Class-Incremental Learners [51.5648732517187]
We propose a novel generative multi-modal model (GMM) framework for class-incremental learning. Our approach directly generates labels for images using an adapted generative model. Under the Few-shot CIL setting, we have improved by at least 14% accuracy over all the current state-of-the-art methods with significantly less forgetting.
arXiv Detail & Related papers (2024-03-27T09:21:07Z)
Bengali Intent Classification with Generative Adversarial BERT [0.24578723416255746]
We introduce BNIntent30, a comprehensive Bengali intent classification dataset containing 30 intent classes. The dataset is excerpted and translated from the CLINIC150 dataset containing a diverse range of user intents categorized over 150 classes. We propose a novel approach for Bengali intent classification using Generative Adversarial BERT to evaluate the proposed dataset, which we call GAN-BnBERT.
arXiv Detail & Related papers (2023-12-17T10:45:50Z)
When the Few Outweigh the Many: Illicit Content Recognition with Few-Shot Learning [0.0]
This paper investigates an alternative technique for recognizing illegal activities from images. Siamese neural networks reach 90.9% on 20-Shot experiments over a 10-class dataset.
arXiv Detail & Related papers (2023-11-28T18:28:03Z)
Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study [0.0]
The performance of datasets in Nepali language has been analyzed for each approach. BERT-based pre-trained models perform well on non-stemmed data whereas scratch transformer models have better performance on stemmed data.
arXiv Detail & Related papers (2023-11-12T17:16:46Z)
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z)
hBert + BiasCorp -- Fighting Racism on the Web [58.768804813646334]
We are releasing BiasCorp, a dataset containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube. In this work, we present hBERT, where we modify certain layers of the pretrained BERT model with the new Hopfield Layer. We are also releasing a JavaScript library and a Chrome Extension Application, to help developers make use of our trained model in web applications.
arXiv Detail & Related papers (2021-04-06T02:17:20Z)
A Federated Approach for Fine-Grained Classification of Fashion Apparel [4.328969982631974]
This paper aims to enable an in-depth classification of fashion item attributes within the same category. The proposed scheme is comprised of three major stages: (a) localization of a target item from an input image using semantic segmentation, (b) detection of human key points (e.g., point of shoulder) using a pre-trained CNN and a bounding box, and (c) three phases to classify the attributes using a combination of algorithmic approaches and deep neural networks.
arXiv Detail & Related papers (2020-08-27T19:44:43Z)
Deep Contextual Embeddings for Address Classification in E-commerce [0.03222802562733786]
E-commerce customers in developing nations like India tend to follow no fixed format while entering shipping addresses. It is imperative to understand the language of addresses, so that shipments can be routed without delays. We propose a novel approach towards understanding customer addresses by deriving motivation from recent advances in Natural Language Processing (NLP)
arXiv Detail & Related papers (2020-07-06T19:06:34Z)
Adversarial Feature Hallucination Networks for Few-Shot Learning [84.31660118264514]
Adversarial Feature Hallucination Networks (AFHN) is based on conditional Wasserstein Generative Adversarial networks (cWGAN) Two novel regularizers are incorporated into AFHN to encourage discriminability and diversity of the synthesized features.
arXiv Detail & Related papers (2020-03-30T02:43:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.