Related papers: ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain

ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain

URL: http://arxiv.org/abs/2304.11960v2
Date: Wed, 26 Apr 2023 13:25:45 GMT
Title: ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain
Authors: Philipp Kuehn, Mike Schmidt, Markus Bayer, Christian Reuter
Abstract summary: A new focused crawler is proposed called ThreatCrawl. It uses BiBERT-based models to classify documents and adapt its crawling path dynamically. It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Publicly available information contains valuable information for Cyber Threat Intelligence (CTI). This can be used to prevent attacks that have already taken place on other systems. Ideally, only the initial attack succeeds and all subsequent ones are detected and stopped. But while there are different standards to exchange this information, a lot of it is shared in articles or blog posts in non-standardized ways. Manually scanning through multiple online portals and news pages to discover new threats and extracting them is a time-consuming task. To automize parts of this scanning process, multiple papers propose extractors that use Natural Language Processing (NLP) to extract Indicators of Compromise (IOCs) from documents. However, while this already solves the problem of extracting the information out of documents, the search for these documents is rarely considered. In this paper, a new focused crawler is proposed called ThreatCrawl, which uses Bidirectional Encoder Representations from Transformers (BERT)-based models to classify documents and adapt its crawling path dynamically. While ThreatCrawl has difficulties to classify the specific type of Open Source Intelligence (OSINT) named in texts, e.g., IOC content, it can successfully find relevant documents and modify its path accordingly. It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art.

Related papers

Bandit on the Hunt: Dynamic Crawling for Cyber Threat Intelligence [0.0]
Public information contains valuable Cyber Threat Intelligence (CTI) that is used to prevent future attacks. Current research focuses on extracting Indicators of Compromise from known sources. This paper proposes a CTI-focused crawler using multi-armed bandit (MAB) and various crawling strategies.
arXiv Detail & Related papers (2025-04-25T14:19:56Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format. We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z)
CRATOR: a Dark Web Crawler [1.7224362150588657]
This study proposes a general dark web crawler designed to extract pages handling security protocols, such as captchas. Our approach uses a combination of seed URL lists, link analysis, and scanning to discover new content.
arXiv Detail & Related papers (2024-05-10T09:39:12Z)
AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports [3.6785107661544805]
We present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.
arXiv Detail & Related papers (2024-04-11T14:04:36Z)
IsoEx: an explainable unsupervised approach to process event logs cyber investigation [0.0]
This paper introduces a novel method, IsoEx, for detecting anomalous and potentially problematic command lines. To detect anomalies, IsoEx resorts to an unsupervised anomaly detection technique that is both highly sensitive and lightweight.
arXiv Detail & Related papers (2023-06-07T14:22:41Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
Precise Zero-Shot Dense Retrieval without Relevance Labels [60.457378374671656]
Hypothetical Document Embeddings(HyDE) is a zero-shot dense retrieval system. We show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
arXiv Detail & Related papers (2022-12-20T18:09:52Z)
ThreatCluster: Threat Clustering for Information Overload Reduction in Computer Emergency Response Teams [0.0]
Threats and diversity of information sources pose challenges for CERTs. To respond to emerging threats, CERTs must gather information in a timely and comprehensive manner. This paper contributes to the question of how to reduce information overload for CERTs.
arXiv Detail & Related papers (2022-10-25T14:50:11Z)
CyNER: A Python Library for Cybersecurity Named Entity Recognition [3.871148938060281]
CyNER is an open-source python library for cybersecurity entity recognition. We provide models trained on a diverse corpus that users can readily use. The library is made publicly available.
arXiv Detail & Related papers (2022-04-08T16:49:32Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
What are the attackers doing now? Automating cyber threat intelligence extraction from text on pace with the changing threat landscape: A survey [1.1064955465386]
We systematically collect "CTI extraction from text"-related studies from the literature. We identify the data sources, techniques, and CTI sharing formats utilized in the context of the proposed pipeline.
arXiv Detail & Related papers (2021-09-14T16:38:41Z)
Short Text Classification Approach to Identify Child Sexual Exploitation Material [4.415977307120616]
This paper presents two approaches based on short text classification to identify Child Sexual Exploitation Material (CSEM) files. The presented solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content.
arXiv Detail & Related papers (2020-10-29T09:37:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.