SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping
- URL: http://arxiv.org/abs/2507.16852v1
- Date: Mon, 21 Jul 2025 09:22:39 GMT
- Title: SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping
- Authors: Álvaro Ruiz-Ródenas, Jaime Pujante Sáez, Daniel García-Algora, Mario Rodríguez Béjar, Jorge Blasco, José Luis Hernández-Ramos,
- Abstract summary: We present SynthCTI, a framework designed to generate high-quality synthetic CTI sentences for underrepresented MITRE ATT&CK techniques.<n>Our method uses a clustering-based strategy to extract semantic context from training data.<n>We evaluate SynthCTI on two publicly available CTI datasets, CTI-to-MITRE and TRAM, using LLMs with different capacity.
- Score: 1.2534672170380357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cyber Threat Intelligence (CTI) mining involves extracting structured insights from unstructured threat data, enabling organizations to understand and respond to evolving adversarial behavior. A key task in CTI mining is mapping threat descriptions to MITRE ATT\&CK techniques. However, this process is often performed manually, requiring expert knowledge and substantial effort. Automated approaches face two major challenges: the scarcity of high-quality labeled CTI data and class imbalance, where many techniques have very few examples. While domain-specific Large Language Models (LLMs) such as SecureBERT have shown improved performance, most recent work focuses on model architecture rather than addressing the data limitations. In this work, we present SynthCTI, a data augmentation framework designed to generate high-quality synthetic CTI sentences for underrepresented MITRE ATT\&CK techniques. Our method uses a clustering-based strategy to extract semantic context from training data and guide an LLM in producing synthetic CTI sentences that are lexically diverse and semantically faithful. We evaluate SynthCTI on two publicly available CTI datasets, CTI-to-MITRE and TRAM, using LLMs with different capacity. Incorporating synthetic data leads to consistent macro-F1 improvements: for example, ALBERT improves from 0.35 to 0.52 (a relative gain of 48.6\%), and SecureBERT reaches 0.6558 (up from 0.4412). Notably, smaller models augmented with SynthCTI outperform larger models trained without augmentation, demonstrating the value of data generation methods for building efficient and effective CTI classification systems.
Related papers
- CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks [57.482238100217195]
We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT)<n>In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning.<n>For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard.
arXiv Detail & Related papers (2025-07-31T17:38:50Z) - SMOTExT: SMOTE meets Large Language Models [19.394116388173885]
We propose a novel technique, SMOTExT, that adapts the idea of Synthetic Minority Over-sampling (SMOTE) to textual data.<n>Our method generates new synthetic examples by interpolating between BERT-based embeddings of two existing examples.<n>In early experiments, training models solely on generated data achieved comparable performance to models trained on the original dataset.
arXiv Detail & Related papers (2025-05-19T17:57:36Z) - Towards Effective Identification of Attack Techniques in Cyber Threat Intelligence Reports using Large Language Models [5.304267859042463]
This work evaluates the performance of Cyber Threat Intelligence (CTI) extraction methods in identifying attack techniques from threat reports available on the web.<n>We analyse four configurations utilising state-of-the-art tools, including the Threat Report ATT&CK Mapper (TRAM) and open-source Large Language Models (LLMs) such as Llama2.<n>Our findings reveal significant challenges, including class imbalance, overfitting, and domain-specific complexity, which impede accurate technique extraction.
arXiv Detail & Related papers (2025-05-06T03:43:12Z) - CTI-HAL: A Human-Annotated Dataset for Cyber Threat Intelligence Analysis [2.7862108332002546]
Cyber Threat Intelligence (CTI) sources are often unstructured and in natural language, making it difficult to automatically extract information.<n>Recent studies have explored the use of AI to perform automatic extraction from CTI data.<n>We introduce a novel dataset manually constructed from CTI reports and structured according to the MITRE ATT&CK framework.
arXiv Detail & Related papers (2025-04-08T09:47:15Z) - AdvKT: An Adversarial Multi-Step Training Framework for Knowledge Tracing [64.79967583649407]
Knowledge Tracing (KT) monitors students' knowledge states and simulates their responses to question sequences.<n>Existing KT models typically follow a single-step training paradigm, which leads to significant error accumulation.<n>We propose a novel Adversarial Multi-Step Training Framework for Knowledge Tracing (AdvKT) which focuses on the multi-step KT task.
arXiv Detail & Related papers (2025-04-07T03:31:57Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models [49.657358248788945]
Textual descriptions in cyber threat intelligence (CTI) reports are rich sources of knowledge about cyber threats.<n>Current CTI knowledge extraction methods lack flexibility and generalizability.<n>We propose CTINexus, a novel framework for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction.
arXiv Detail & Related papers (2024-10-28T14:18:32Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Synthetic Network Traffic Data Generation: A Comparative Study [0.0]
Existing methods for synthetic data generation differ significantly in their ability to maintain statistical fidelity, utility for classification tasks, and class balance.<n>This study presents a comparative analysis of twelve synthetic network traffic data generation methods, encompassing non-AI (statistical), classical AI, and generative AI techniques.<n>Results demonstrate that GAN-based models, particularly CTGAN and CopulaGAN, achieve superior fidelity and utility, making them ideal for high-quality synthetic data generation.
arXiv Detail & Related papers (2024-10-18T14:19:25Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.