Identification of Malicious Posts on the Dark Web Using Supervised Machine Learning
- URL: http://arxiv.org/abs/2511.23183v1
- Date: Fri, 28 Nov 2025 13:51:18 GMT
- Title: Identification of Malicious Posts on the Dark Web Using Supervised Machine Learning
- Authors: Sebastião Alves de Jesus Filho, Gustavo Di Giovanni Bernardo, Paulo Henrique Ribeiro Gabriel, Bruno Bogaz Zarpelão, Rodrigo Sanches Miani,
- Abstract summary: This study applies text mining techniques and machine learning to data collected from Dark Web forums in Brazilian Portuguese to identify malicious posts.<n>To our knowledge, this is the first study to focus specifically on Brazilian Portuguese content in this domain.<n>The best-performing model, using LightGBM and TF-IDF, was able to detect relevant posts with high accuracy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the constant growth and increasing sophistication of cyberattacks, cybersecurity can no longer rely solely on traditional defense techniques and tools. Proactive detection of cyber threats has become essential to help security teams identify potential risks and implement effective mitigation measures. Cyber Threat Intelligence (CTI) plays a key role by providing security analysts with evidence-based knowledge about cyber threats. CTI information can be extracted using various techniques and data sources; however, machine learning has proven promising. As for data sources, social networks and online discussion forums are commonly explored. In this study, we apply text mining techniques and machine learning to data collected from Dark Web forums in Brazilian Portuguese to identify malicious posts. Our contributions include the creation of three original datasets, a novel multi-stage labeling process combining indicators of compromise (IoCs), contextual keywords, and manual analysis, and a comprehensive evaluation of text representations and classifiers. To our knowledge, this is the first study to focus specifically on Brazilian Portuguese content in this domain. The best-performing model, using LightGBM and TF-IDF, was able to detect relevant posts with high accuracy. We also applied topic modeling to validate the model's outputs on unlabeled data, confirming its robustness in real-world scenarios.
Related papers
- False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems [1.4932549821542682]
Cyber Threat Intelligence (CTI) has emerged as a vital complementary approach that operates in the early phases of the cyber threat lifecycle.<n>Due to the large volume of data, automation through Machine Learning (ML) and Natural Language Processing (NLP) models is essential for effective CTI extraction.<n>This study investigates vulnerabilities within various components of the entire CTI pipeline and their susceptibility to adversarial attacks.
arXiv Detail & Related papers (2025-07-05T19:00:27Z) - Technique Inference Engine: A Recommender Model to Support Cyber Threat Hunting [0.6990493129893112]
Cyber threat hunting is the practice of proactively searching for latent threats in a network.<n>To aid analysts in identifying techniques which may be co-occurring as part of a campaign, we present the Technique Inference Engine.
arXiv Detail & Related papers (2025-03-04T22:31:43Z) - CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models [49.657358248788945]
Textual descriptions in cyber threat intelligence (CTI) reports are rich sources of knowledge about cyber threats.<n>Current CTI knowledge extraction methods lack flexibility and generalizability.<n>We propose CTINexus, a novel framework for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction.
arXiv Detail & Related papers (2024-10-28T14:18:32Z) - Countering Autonomous Cyber Threats [40.00865970939829]
Foundation Models present dual-use concerns broadly and within the cyber domain specifically.
Recent research has shown the potential for these advanced models to inform or independently execute offensive cyberspace operations.
This work evaluates several state-of-the-art FMs on their ability to compromise machines in an isolated network and investigates defensive mechanisms to defeat such AI-powered attacks.
arXiv Detail & Related papers (2024-10-23T22:46:44Z) - KGV: Integrating Large Language Models with Knowledge Graphs for Cyber Threat Intelligence Credibility Assessment [38.312774244521]
Cyber threat intelligence (CTI) is a crucial tool to prevent sophisticated, organized, and weaponized cyber attacks.<n>We propose Knowledge Graph-based Verifier (KGV), the first framework integrating large language models (LLMs) with simple structured knowledge graphs (KGs) for automated CTI credibility assessment.<n> Experimental results demonstrate that our KGV outperforms state-of-the-art fact reasoning methods on the CTI-200 dataset, achieving a 5.7% improvement in F1.
arXiv Detail & Related papers (2024-08-15T11:32:46Z) - A Survey on Machine Unlearning: Techniques and New Emerged Privacy Risks [42.3024294376025]
Machine unlearning is a research hotspot in the field of privacy protection.
Recent researchers have found potential privacy leakages of various of machine unlearning approaches.
We analyze privacy risks in various aspects, including definitions, implementation methods, and real-world applications.
arXiv Detail & Related papers (2024-06-10T11:31:04Z) - Graph Mining for Cybersecurity: A Survey [61.505995908021525]
The explosive growth of cyber attacks nowadays, such as malware, spam, and intrusions, caused severe consequences on society.
Traditional Machine Learning (ML) based methods are extensively used in detecting cyber threats, but they hardly model the correlations between real-world cyber entities.
With the proliferation of graph mining techniques, many researchers investigated these techniques for capturing correlations between cyber entities and achieving high performance.
arXiv Detail & Related papers (2023-04-02T08:43:03Z) - Exploring the Limits of Transfer Learning with Unified Model in the
Cybersecurity Domain [17.225973170682604]
We introduce a generative multi-task model, Unified Text-to-Text Cybersecurity (UTS)
UTS is trained on malware reports, phishing site URLs, programming code constructs, social media data, blogs, news articles, and public forum posts.
We show UTS improves the performance of some cybersecurity datasets.
arXiv Detail & Related papers (2023-02-20T22:21:26Z) - Recognizing and Extracting Cybersecurtity-relevant Entities from Text [1.7499351967216343]
Cyber Threat Intelligence (CTI) is information describing threat vectors, vulnerabilities, and attacks.
CTI is often used as training data for AI-based cyber defense systems such as Cybersecurity Knowledge Graphs (CKG)
arXiv Detail & Related papers (2022-08-02T18:44:06Z) - Towards Automated Classification of Attackers' TTPs by combining NLP
with ML Techniques [77.34726150561087]
We evaluate and compare different Natural Language Processing (NLP) and machine learning techniques used for security information extraction in research.
Based on our investigations we propose a data processing pipeline that automatically classifies unstructured text according to attackers' tactics and techniques.
arXiv Detail & Related papers (2022-07-18T09:59:21Z) - Generating Cyber Threat Intelligence to Discover Potential Security
Threats Using Classification and Topic Modeling [6.0897744845912865]
Cyber Threat Intelligence (CTI) has been represented as one of the proactive and robust mechanisms.
Our goal is to identify and explore relevant CTI from hacker forums by using different supervised and unsupervised learning techniques.
arXiv Detail & Related papers (2021-08-16T02:30:29Z) - Dos and Don'ts of Machine Learning in Computer Security [74.1816306998445]
Despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance.
We identify common pitfalls in the design, implementation, and evaluation of learning-based security systems.
We propose actionable recommendations to support researchers in avoiding or mitigating the pitfalls where possible.
arXiv Detail & Related papers (2020-10-19T13:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.