ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain
- URL: http://arxiv.org/abs/2304.11960v2
- Date: Wed, 26 Apr 2023 13:25:45 GMT
- Title: ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain
- Authors: Philipp Kuehn, Mike Schmidt, Markus Bayer, Christian Reuter
- Abstract summary: A new focused crawler is proposed called ThreatCrawl.
It uses BiBERT-based models to classify documents and adapt its crawling path dynamically.
It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Publicly available information contains valuable information for Cyber Threat
Intelligence (CTI). This can be used to prevent attacks that have already taken
place on other systems. Ideally, only the initial attack succeeds and all
subsequent ones are detected and stopped. But while there are different
standards to exchange this information, a lot of it is shared in articles or
blog posts in non-standardized ways. Manually scanning through multiple online
portals and news pages to discover new threats and extracting them is a
time-consuming task. To automize parts of this scanning process, multiple
papers propose extractors that use Natural Language Processing (NLP) to extract
Indicators of Compromise (IOCs) from documents. However, while this already
solves the problem of extracting the information out of documents, the search
for these documents is rarely considered. In this paper, a new focused crawler
is proposed called ThreatCrawl, which uses Bidirectional Encoder
Representations from Transformers (BERT)-based models to classify documents and
adapt its crawling path dynamically. While ThreatCrawl has difficulties to
classify the specific type of Open Source Intelligence (OSINT) named in texts,
e.g., IOC content, it can successfully find relevant documents and modify its
path accordingly. It yields harvest rates of up to 52%, which are, to the best
of our knowledge, better than the current state of the art.
Related papers
- Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges [52.96987928118327]
We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to content injection attacks.
We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance.
Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
arXiv Detail & Related papers (2025-01-30T18:02:15Z) - Model Inversion Attacks: A Survey of Approaches and Countermeasures [59.986922963781]
Recently, a new type of privacy attack, the model inversion attacks (MIAs), aims to extract sensitive features of private data for training.
Despite the significance, there is a lack of systematic studies that provide a comprehensive overview and deeper insights into MIAs.
This survey aims to summarize up-to-date MIA methods in both attacks and defenses.
arXiv Detail & Related papers (2024-11-15T08:09:28Z) - AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports [3.6785107661544805]
We present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports.
The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts.
In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.
arXiv Detail & Related papers (2024-04-11T14:04:36Z) - On the Security Risks of Knowledge Graph Reasoning [71.64027889145261]
We systematize the security threats to KGR according to the adversary's objectives, knowledge, and attack vectors.
We present ROAR, a new class of attacks that instantiate a variety of such threats.
We explore potential countermeasures against ROAR, including filtering of potentially poisoning knowledge and training with adversarially augmented queries.
arXiv Detail & Related papers (2023-05-03T18:47:42Z) - Cybersecurity Threat Hunting and Vulnerability Analysis Using a Neo4j Graph Database of Open Source Intelligence [0.8192907805418583]
We present a system which constructs a Neo4j graph database formed by shared connections between open source intelligence text and other information.
These connections are comprised of possible indicators of compromise (e.g., IP addresses, domains, hashes, email addresses, phone numbers) and information on known exploits and techniques.
We show three specific examples of interesting connections found in the graph database; the connections to a known exploited CVE, a known malicious IP address, and a malware hash signature.
arXiv Detail & Related papers (2023-01-27T22:29:22Z) - Invisible Backdoor Attack with Dynamic Triggers against Person
Re-identification [71.80885227961015]
Person Re-identification (ReID) has rapidly progressed with wide real-world applications, but also poses significant risks of adversarial attacks.
We propose a novel backdoor attack on ReID under a new all-to-unknown scenario, called Dynamic Triggers Invisible Backdoor Attack (DT-IBA)
We extensively validate the effectiveness and stealthiness of the proposed attack on benchmark datasets, and evaluate the effectiveness of several defense methods against our attack.
arXiv Detail & Related papers (2022-11-20T10:08:28Z) - What are the attackers doing now? Automating cyber threat intelligence
extraction from text on pace with the changing threat landscape: A survey [1.1064955465386]
We systematically collect "CTI extraction from text"-related studies from the literature.
We identify the data sources, techniques, and CTI sharing formats utilized in the context of the proposed pipeline.
arXiv Detail & Related papers (2021-09-14T16:38:41Z) - Generating Cyber Threat Intelligence to Discover Potential Security
Threats Using Classification and Topic Modeling [6.0897744845912865]
Cyber Threat Intelligence (CTI) has been represented as one of the proactive and robust mechanisms.
Our goal is to identify and explore relevant CTI from hacker forums by using different supervised and unsupervised learning techniques.
arXiv Detail & Related papers (2021-08-16T02:30:29Z) - EXTRACTOR: Extracting Attack Behavior from Threat Reports [6.471387545969443]
We propose a novel approach and tool called provenanceOR that allows precise automatic extraction of concise attack behaviors from CTI reports.
provenanceOR makes no strong assumptions about the text and is capable of extracting attack behaviors as graphs from unstructured text.
Our evaluation results show that provenanceOR can extract concise graphs from CTI reports and can successfully be used by cyber-analytics tools in threat-hunting.
arXiv Detail & Related papers (2021-04-17T18:51:00Z) - A System for Automated Open-Source Threat Intelligence Gathering and
Management [53.65687495231605]
SecurityKG is a system for automated OSCTI gathering and management.
It uses a combination of AI and NLP techniques to extract high-fidelity knowledge about threat behaviors.
arXiv Detail & Related papers (2021-01-19T18:31:35Z) - Backdoor Learning: A Survey [75.59571756777342]
Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs)
Backdoor learning is an emerging and rapidly growing research area.
This paper presents the first comprehensive survey of this realm.
arXiv Detail & Related papers (2020-07-17T04:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.