Related papers: A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence

A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence

URL: http://arxiv.org/abs/2109.06932v1
Date: Tue, 14 Sep 2021 19:26:08 GMT
Title: A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
Authors: Paris Koloveas, Thanasis Chantzios, Christos Tryfonopoulos, Spiros Skiadopoulos
Abstract summary: The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information. We present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.
Score: 1.1661238776379117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its potential relevance to the task at hand. The proposed architecture is realised using exclusively open-source tools, and a preliminary evaluation with crowdsourced results demonstrates its effectiveness.

Related papers

Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents [49.88380945341337]
We decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes.<n>To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites.<n>Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework.
arXiv Detail & Related papers (2025-08-03T17:17:52Z)
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z)
Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence [109.32705135051486]
Embodied Web Agents is a novel paradigm for AI agents that fluidly bridge the embodiment and web-scale reasoning.<n>We release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks.<n>Results reveal significant performance gaps between state-of-the-art AI systems and human capabilities.
arXiv Detail & Related papers (2025-06-18T17:58:17Z)
Combining Threat Intelligence with IoT Scanning to Predict Cyber Attack [0.0]
Malicious actors, including hackers and hacktivist groups, often disseminate ideological content and coordinate activities through the "Dark Web" This paper proposes a novel predictive threat intelligence framework designed to systematically collect, analyze, and visualize Dark Web data.
arXiv Detail & Related papers (2024-11-26T23:00:51Z)
CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity [49.657358248788945]
Textual descriptions in cyber threat intelligence (CTI) reports are rich sources of knowledge about cyber threats. Current CTI extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. We propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models.
arXiv Detail & Related papers (2024-10-28T14:18:32Z)
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z)
Networking Systems for Video Anomaly Detection: A Tutorial and Survey [55.28514053969056]
Video Anomaly Detection (VAD) is a fundamental research task within the Artificial Intelligence (AI) community. With the advancements in deep learning and edge computing, VAD has made significant progress. This article offers an exhaustive tutorial for novices in NSVAD.
arXiv Detail & Related papers (2024-05-16T02:00:44Z)
Informed Meta-Learning [55.2480439325792]
Meta-learning and informed ML stand out as two approaches for incorporating prior knowledge into ML pipelines. We formalise a hybrid paradigm, informed meta-learning, facilitating the incorporation of priors from unstructured knowledge representations. We demonstrate the potential benefits of informed meta-learning in improving data efficiency, robustness to observational noise and task distribution shifts.
arXiv Detail & Related papers (2024-02-25T15:08:37Z)
TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild [0.06597195879147556]
The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy. Previous research has focused on improving individual components of the extraction process. The community lacks open-source platforms for deploying streaming CTI data pipelines in the wild.
arXiv Detail & Related papers (2024-02-15T14:29:21Z)
A Responsive Framework for Research Portals Data using Semantic Web Technology [0.6798775532273751]
The research aims to address this issue by designing a framework for the semantic organization of research portal data. The framework focuses on the extraction of information from two specific research portals, namely Microsoft Academic and IEEE Xplore.
arXiv Detail & Related papers (2023-06-20T16:12:33Z)
Your Room is not Private: Gradient Inversion Attack on Reinforcement Learning [47.96266341738642]
Privacy emerges as a pivotal concern within the realm of embodied AI, as the robot accesses substantial personal information. This paper proposes an attack on the value-based algorithm and the gradient-based algorithm, utilizing gradient inversion to reconstruct states, actions, and supervision signals.
arXiv Detail & Related papers (2023-06-15T16:53:26Z)
Graph Mining for Cybersecurity: A Survey [61.505995908021525]
The explosive growth of cyber attacks nowadays, such as malware, spam, and intrusions, caused severe consequences on society. Traditional Machine Learning (ML) based methods are extensively used in detecting cyber threats, but they hardly model the correlations between real-world cyber entities. With the proliferation of graph mining techniques, many researchers investigated these techniques for capturing correlations between cyber entities and achieving high performance.
arXiv Detail & Related papers (2023-04-02T08:43:03Z)
Recognizing and Extracting Cybersecurtity-relevant Entities from Text [1.7499351967216343]
Cyber Threat Intelligence (CTI) is information describing threat vectors, vulnerabilities, and attacks. CTI is often used as training data for AI-based cyber defense systems such as Cybersecurity Knowledge Graphs (CKG)
arXiv Detail & Related papers (2022-08-02T18:44:06Z)
Knowledge mining of unstructured information: application to cyber-domain [0.0]
We present and implement a novel knowledge graph and knowledge mining framework for extracting relevant information from free-form text about incidents in the cyber domain. Our framework includes a machine learning based pipeline as well as crawling methods for generating graphs of entities, attackers and the related information. We test our framework on publicly available cyber incident datasets to evaluate the accuracy of our knowledge mining methods as well as the usefulness of the framework in the use of cyber analysts.
arXiv Detail & Related papers (2021-09-08T18:01:56Z)
Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses [150.64470864162556]
This work systematically categorizes and discusses a wide range of dataset vulnerabilities and exploits. In addition to describing various poisoning and backdoor threat models and the relationships among them, we develop their unified taxonomy.
arXiv Detail & Related papers (2020-12-18T22:38:47Z)
Intrusion detection in computer systems by using artificial neural networks with Deep Learning approaches [0.0]
Intrusion detection into computer networks has become one of the most important issues in cybersecurity. This paper focuses on the design and implementation of an intrusion detection system based on Deep Learning architectures.
arXiv Detail & Related papers (2020-12-15T19:12:23Z)
A Privacy-Preserving Distributed Architecture for Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service. It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.