Related papers: AttackSeqBench: Benchmarking Large Language Models in Analyzing Attack Sequences within Cyber Threat Intelligence

AttackSeqBench: Benchmarking Large Language Models in Analyzing Attack Sequences within Cyber Threat Intelligence

URL: http://arxiv.org/abs/2503.03170v2
Date: Sun, 05 Oct 2025 10:27:08 GMT
Title: AttackSeqBench: Benchmarking Large Language Models in Analyzing Attack Sequences within Cyber Threat Intelligence
Authors: Haokai Ma, Javier Yong, Yunshan Ma, Kuei Chen, Anis Yusof, Zhenkai Liang, Ee-Chien Chang,
Abstract summary: Cyber Threat Intelligence (CTI) reports document observations of cyber threats, synthesizing evidence about adversaries' actions and intent into actionable knowledge.<n>The unstructured and verbose nature of CTI reports poses significant challenges for security practitioners to manually extract and analyze such sequences.<n>Although large language models (LLMs) exhibit promise in cybersecurity tasks such as entity extraction and knowledge graph construction, their understanding and reasoning capabilities towards behavioral sequences remains underexplored.
Score: 17.234214109636113
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cyber Threat Intelligence (CTI) reports document observations of cyber threats, synthesizing evidence about adversaries' actions and intent into actionable knowledge that informs detection, response, and defense planning. However, the unstructured and verbose nature of CTI reports poses significant challenges for security practitioners to manually extract and analyze such sequences. Although large language models (LLMs) exhibit promise in cybersecurity tasks such as entity extraction and knowledge graph construction, their understanding and reasoning capabilities towards behavioral sequences remains underexplored. To address this, we introduce AttackSeqBench, a benchmark designed to systematically evaluate LLMs' reasoning abilities across the tactical, technical, and procedural dimensions of adversarial behaviors, while satisfying Extensibility, Reasoning Scalability, and Domain-dpecific Epistemic Expandability. We further benchmark 7 LLMs, 5 LRMs and 4 post-training strategies across the proposed 3 benchmark settings and 3 benchmark tasks within our AttackSeqBench to identify their advantages and limitations in such specific domain. Our findings contribute to a deeper understanding of LLM-driven CTI report understanding and foster its application in cybersecurity operations.

Related papers

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5 [61.787178868669265]
This technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R&D, and self-replication.<n>This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
arXiv Detail & Related papers (2026-02-16T04:30:06Z)
Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents [52.14392337070763]
We introduce CFG-Bench, a new benchmark designed to systematically evaluate fine-grained action intelligence.<n>CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities.<n>Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions.
arXiv Detail & Related papers (2025-11-24T02:02:29Z)
CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence [48.63397742510097]
Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats.<n>With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI.<n>We present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI.
arXiv Detail & Related papers (2025-10-13T22:10:17Z)
POLAR: Automating Cyber Threat Prioritization through LLM-Powered Assessment [13.18964488705143]
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats.<n>In this paper, we investigate the intrinsic vulnerabilities of LLMs in cyber threat intelligence (CTI)<n>We introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-02T00:49:20Z)
Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence [15.881854286231997]
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats.<n>In this paper, we investigate the intrinsic vulnerabilities of LLMs in cyber threat intelligence (CTI)<n>We introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision.
arXiv Detail & Related papers (2025-09-28T02:08:27Z)
Advancing Autonomous Incident Response: Leveraging LLMs and Cyber Threat Intelligence [3.2284427438223013]
Security teams are overwhelmed by alert fatigue, high false-positive rates, and the vast volume of unstructured Cyber Threat Intelligence (CTI) documents.<n>We introduce a novel Retrieval-Augmented Generation (RAG)-based framework that leverages Large Language Models (LLMs) to automate and enhance IR.<n>Our approach introduces a hybrid retrieval mechanism that combines NLP-based similarity searches within a CTI vector database with standardized queries to external CTI platforms.
arXiv Detail & Related papers (2025-08-14T14:20:34Z)
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z)
A Survey on Model Extraction Attacks and Defenses for Large Language Models [55.60375624503877]
Model extraction attacks pose significant security threats to deployed language models.<n>This survey provides a comprehensive taxonomy of extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks.<n>We examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios.
arXiv Detail & Related papers (2025-06-26T22:02:01Z)
Lazarus Group Targets Crypto-Wallets and Financial Data while employing new Tradecrafts [0.0]
This report presents a comprehensive analysis of a malicious software sample, detailing its architecture, behavioral characteristics, and underlying intent.<n>The malware core functionalities, including persistence mechanisms, command-and-control communication, and data exfiltration routines, are identified.<n>This malware analysis report not only reconstructs past adversary actions but also establishes a robust foundation for anticipating and mitigating future attacks.
arXiv Detail & Related papers (2025-05-27T20:13:29Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis [47.34614558636679]
This study investigates the underlying factors that contribute to the increased vulnerability of Web AI agents.<n>We identify three critical factors that amplify the vulnerability of Web AI agents; (1) embedding user goals into the system prompt, (2) multi-step action generation, and (3) observational capabilities.
arXiv Detail & Related papers (2025-02-27T18:56:26Z)
OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities [0.0]
We demonstrate a new approach to assessing AI's progress towards enabling and scaling real-world offensive cyber operations. We detail OCCULT, a lightweight operational evaluation framework that allows cyber security experts to contribute to rigorous and repeatable measurement. We find that there has been significant recent advancement in the risks of AI being used to scale realistic cyber threats.
arXiv Detail & Related papers (2025-02-18T19:33:14Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights [12.424610893030353]
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection. This paper provides a detailed survey of LLMs in vulnerability detection. We address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis.
arXiv Detail & Related papers (2025-02-10T21:33:38Z)
CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity [49.657358248788945]
Textual descriptions in cyber threat intelligence (CTI) reports are rich sources of knowledge about cyber threats. Current CTI extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. We propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models.
arXiv Detail & Related papers (2024-10-28T14:18:32Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models [4.564507064383306]
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation. Despite these advancements, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies.
arXiv Detail & Related papers (2024-10-20T00:00:56Z)
CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization [14.287652216484863]
Cyber Threat Intelligence (CTI) summarization involves generating concise and accurate highlights from web intelligence data.<n>We introduce CTISum, a new benchmark dataset designed for the CTI summarization task.<n>We also propose a novel fine-grained subtask: attack process summarization, which aims to help defenders assess risks, identify security gaps, and uncover vulnerabilities.
arXiv Detail & Related papers (2024-08-13T02:25:16Z)
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence [0.7499722271664147]
CTIBench is a benchmark designed to assess Large Language Models' performance in CTI applications. Our evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses in CTI contexts.
arXiv Detail & Related papers (2024-06-11T16:42:02Z)
Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models [8.167614500821223]
We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction. Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul) with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset.
arXiv Detail & Related papers (2024-06-09T19:18:05Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Data Poisoning for In-context Learning [49.77204165250528]
In-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks. This paper delves into the critical issue of ICL's susceptibility to data poisoning attacks. We introduce ICLPoison, a specialized attacking framework conceived to exploit the learning mechanisms of ICL.
arXiv Detail & Related papers (2024-02-03T14:20:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.