Bandit on the Hunt: Dynamic Crawling for Cyber Threat Intelligence
- URL: http://arxiv.org/abs/2504.18375v1
- Date: Fri, 25 Apr 2025 14:19:56 GMT
- Title: Bandit on the Hunt: Dynamic Crawling for Cyber Threat Intelligence
- Authors: Philipp Kuehn, Dilara Nadermahmoodi, Markus Bayer, Christian Reuter,
- Abstract summary: Public information contains valuable Cyber Threat Intelligence (CTI) that is used to prevent future attacks.<n>Current research focuses on extracting Indicators of Compromise from known sources.<n>This paper proposes a CTI-focused crawler using multi-armed bandit (MAB) and various crawling strategies.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Public information contains valuable Cyber Threat Intelligence (CTI) that is used to prevent future attacks. While standards exist for sharing this information, much appears in non-standardized news articles or blogs. Monitoring online sources for threats is time-consuming and source selection is uncertain. Current research focuses on extracting Indicators of Compromise from known sources, rarely addressing new source identification. This paper proposes a CTI-focused crawler using multi-armed bandit (MAB) and various crawling strategies. It employs SBERT to identify relevant documents while dynamically adapting its crawling path. Our system ThreatCrawl achieves a harvest rate exceeding 25% and expands its seed by over 300% while maintaining topical focus. Additionally, the crawler identifies previously unknown but highly relevant overview pages, datasets, and domains.
Related papers
- Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges [52.96987928118327]
We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to content injection attacks.<n>We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance.<n>Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
arXiv Detail & Related papers (2025-01-30T18:02:15Z) - Model Inversion Attacks: A Survey of Approaches and Countermeasures [59.986922963781]
Recently, a new type of privacy attack, the model inversion attacks (MIAs), aims to extract sensitive features of private data for training.
Despite the significance, there is a lack of systematic studies that provide a comprehensive overview and deeper insights into MIAs.
This survey aims to summarize up-to-date MIA methods in both attacks and defenses.
arXiv Detail & Related papers (2024-11-15T08:09:28Z) - AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports [3.6785107661544805]
We present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports.
The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts.
In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.
arXiv Detail & Related papers (2024-04-11T14:04:36Z) - Robust Recommender System: A Survey and Future Directions [58.87305602959857]
We first present a taxonomy to organize current techniques for withstanding malicious attacks and natural noise.
We then explore state-of-the-art methods in each category, including fraudster detection, adversarial training, certifiable robust training for defending against malicious attacks.
We discuss robustness across varying recommendation scenarios and its interplay with other properties like accuracy, interpretability, privacy, and fairness.
arXiv Detail & Related papers (2023-09-05T08:58:46Z) - On the Security Risks of Knowledge Graph Reasoning [71.64027889145261]
We systematize the security threats to KGR according to the adversary's objectives, knowledge, and attack vectors.
We present ROAR, a new class of attacks that instantiate a variety of such threats.
We explore potential countermeasures against ROAR, including filtering of potentially poisoning knowledge and training with adversarially augmented queries.
arXiv Detail & Related papers (2023-05-03T18:47:42Z) - ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain [0.0]
This paper proposes a new focused crawler called ThreatCrawl.<n>It uses BiBERT-based models to classify documents and adapt its crawling path dynamically.<n>It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art.
arXiv Detail & Related papers (2023-04-24T09:53:33Z) - Cybersecurity Threat Hunting and Vulnerability Analysis Using a Neo4j Graph Database of Open Source Intelligence [0.8192907805418583]
We present a system which constructs a Neo4j graph database formed by shared connections between open source intelligence text and other information.
These connections are comprised of possible indicators of compromise (e.g., IP addresses, domains, hashes, email addresses, phone numbers) and information on known exploits and techniques.
We show three specific examples of interesting connections found in the graph database; the connections to a known exploited CVE, a known malicious IP address, and a malware hash signature.
arXiv Detail & Related papers (2023-01-27T22:29:22Z) - ThreatKG: An AI-Powered System for Automated Open-Source Cyber Threat Intelligence Gathering and Management [65.0114141380651]
ThreatKG is an automated system for OSCTI gathering and management.
It efficiently collects a large number of OSCTI reports from multiple sources.
It uses specialized AI-based techniques to extract high-quality knowledge about various threat entities.
arXiv Detail & Related papers (2022-12-20T16:13:59Z) - Reducing Information Overload: Because Even Security Experts Need to Blink [0.0]
Computer Emergency Response Teams (CERTs) face increasing challenges processing the growing volume of security-related information.<n>This work evaluates 196 combinations of clustering algorithms and embedding models across five security-related datasets to identify optimal approaches for automated information consolidation.<n>We demonstrate that clustering can reduce information processing requirements by over 90% while maintaining semantic coherence.
arXiv Detail & Related papers (2022-10-25T14:50:11Z) - What are the attackers doing now? Automating cyber threat intelligence
extraction from text on pace with the changing threat landscape: A survey [1.1064955465386]
We systematically collect "CTI extraction from text"-related studies from the literature.
We identify the data sources, techniques, and CTI sharing formats utilized in the context of the proposed pipeline.
arXiv Detail & Related papers (2021-09-14T16:38:41Z) - Generating Cyber Threat Intelligence to Discover Potential Security
Threats Using Classification and Topic Modeling [6.0897744845912865]
Cyber Threat Intelligence (CTI) has been represented as one of the proactive and robust mechanisms.
Our goal is to identify and explore relevant CTI from hacker forums by using different supervised and unsupervised learning techniques.
arXiv Detail & Related papers (2021-08-16T02:30:29Z) - A System for Automated Open-Source Threat Intelligence Gathering and
Management [53.65687495231605]
SecurityKG is a system for automated OSCTI gathering and management.
It uses a combination of AI and NLP techniques to extract high-fidelity knowledge about threat behaviors.
arXiv Detail & Related papers (2021-01-19T18:31:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.