S-DAPT-2026: A Stage-Aware Synthetic Dataset for Advanced Persistent Threat Detection
- URL: http://arxiv.org/abs/2601.06690v1
- Date: Sat, 10 Jan 2026 21:25:41 GMT
- Title: S-DAPT-2026: A Stage-Aware Synthetic Dataset for Advanced Persistent Threat Detection
- Authors: Saleem Ishaq Tijjani, Bogdan Ghita, Nathan Clarke, Matthew Craven,
- Abstract summary: This paper presents a near realistic synthetic APT dataset and an efficient alert correlation framework.<n>The proposed approach introduces a machine learning based correlation module that employs K Nearest Neighbors (KNN) clustering with a cosine similarity metric to group semantically related alerts.<n>A comprehensive statistical characterization of the dataset is provided to facilitate aware and support APT stage predictions.
- Score: 0.0538441598991272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The detection of advanced persistent threats (APTs) remains a crucial challenge due to their stealthy, multistage nature and the limited availability of realistic, labeled datasets for systematic evaluation. Synthetic dataset generation has emerged as a practical approach for modeling APT campaigns; however, existing methods often rely on computationally expensive alert correlation mechanisms that limit scalability. Motivated by these limitations, this paper presents a near realistic synthetic APT dataset and an efficient alert correlation framework. The proposed approach introduces a machine learning based correlation module that employs K Nearest Neighbors (KNN) clustering with a cosine similarity metric to group semantically related alerts within a temporal context. The dataset emulates multistage APT campaigns across campus and organizational network environments and captures a diverse set of fourteen distinct alert types, exceeding the coverage of commonly used synthetic APT datasets. In addition, explicit APT campaign states and alert to stage mappings are defined to enable flexible integration of new alert types and support stage aware analysis. A comprehensive statistical characterization of the dataset is provided to facilitate reproducibility and support APT stage predictions.
Related papers
- Taipan: A Query-free Transfer-based Multiple Sensitive Attribute Inference Attack Solely from Publicly Released Graphs [4.838500914184325]
We introduce textbfTaipan, the first query-free transfer-based attack framework for multiple sensitive attribute inference attacks on graphs.<n>Experiments on diverse real-world graph datasets demonstrate that Taipan consistently achieves strong attack performance across same-distribution settings.<n>Our findings underscore the urgent need for more robust multi-attribute privacy-preserving graph publishing methods and data-sharing practices.
arXiv Detail & Related papers (2026-02-06T13:37:24Z) - APT-MCL: An Adaptive APT Detection System Based on Multi-View Collaborative Provenance Graph Learning [14.65353464010361]
Advanced persistent threats (APTs) are stealthy and multi-stage, making single-point defenses ill-suited to capture long-range and cross-entity attack semantics.<n>This paper proposes APT-MCL, an intelligent APT detection system based on Multi-view Collaborative graph Learning.
arXiv Detail & Related papers (2026-01-13T08:30:43Z) - Deep Recurrent Hidden Markov Learning Framework for Multi-Stage Advanced Persistent Threat Prediction [0.0538441598991272]
Advanced Persistent Threats (APTs) represent hidden, multistage cyberattacks whose long term persistence and adaptive behavior challenge conventional intrusion detection systems (IDS)<n>This paper proposes E-HiDNet, a unified hybrid deep probabilistic learning framework that integrates convolutional and recurrent neural networks with a Hidden Markov Model (HMM) to allow accurate prediction of the progression of the APT campaign.<n> Simulation results show that E-HiDNet achieves up to 98.8-100% accuracy in stage prediction and significantly outperforms standalone HMMs when four or more observations are available.
arXiv Detail & Related papers (2026-01-11T01:01:10Z) - APT-CGLP: Advanced Persistent Threat Hunting via Contrastive Graph-Language Pre-Training [33.84587345029278]
Provenance-based threat hunting identifies Advanced Persistent Threats (APTs) on endpoints by correlating attack patterns described in Cyber Threat Intelligence (CTI) with provenance graphs derived from system audit logs.<n>A fundamental challenge in this paradigm lies in the modality gap--the structural and semantic disconnect between provenance graphs and CTI reports.<n>We present APT-CGLP, a novel cross-modal APT hunting system via Contrastive Graph-Language Pre-training.
arXiv Detail & Related papers (2025-11-25T13:20:12Z) - FedGPS: Statistical Rectification Against Data Heterogeneity in Federated Learning [103.45987800174724]
Federated Learning (FL) confronts a significant challenge known as data heterogeneity, which impairs model performance and convergence.<n>We propose textbfFedGPS, a novel framework that seamlessly integrates statistical distribution and gradient information from others.
arXiv Detail & Related papers (2025-10-23T06:10:11Z) - Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction [20.1863553357121]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies various dense prediction tasks and diverse semantic class predictions.
arXiv Detail & Related papers (2025-05-18T07:39:17Z) - Conditional Data Synthesis Augmentation [4.3108820946281945]
Conditional Data Synthesis Augmentation (CoDSA) is a novel framework that synthesizes high-fidelity data for improving model performance across multimodal domains.<n>CoDSA fine-tunes pre-trained generative models to enhance the realism of synthetic data and increase sample density in sparse areas.<n>We introduce a theoretical framework that quantifies the statistical accuracy improvements enabled by CoDSA as a function of synthetic sample volume and targeted region allocation.
arXiv Detail & Related papers (2025-04-10T03:38:11Z) - CTI-HAL: A Human-Annotated Dataset for Cyber Threat Intelligence Analysis [2.7862108332002546]
Cyber Threat Intelligence (CTI) sources are often unstructured and in natural language, making it difficult to automatically extract information.<n>Recent studies have explored the use of AI to perform automatic extraction from CTI data.<n>We introduce a novel dataset manually constructed from CTI reports and structured according to the MITRE ATT&CK framework.
arXiv Detail & Related papers (2025-04-08T09:47:15Z) - AdvKT: An Adversarial Multi-Step Training Framework for Knowledge Tracing [64.79967583649407]
Knowledge Tracing (KT) monitors students' knowledge states and simulates their responses to question sequences.<n>Existing KT models typically follow a single-step training paradigm, which leads to significant error accumulation.<n>We propose a novel Adversarial Multi-Step Training Framework for Knowledge Tracing (AdvKT) which focuses on the multi-step KT task.
arXiv Detail & Related papers (2025-04-07T03:31:57Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions.
We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.