Related papers: AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

URL: http://arxiv.org/abs/2511.01144v1
Date: Mon, 03 Nov 2025 01:45:29 GMT
Title: AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
Authors: Md Tanvirul Alam, Dipkamal Bhusal, Salman Ahmad, Nidhi Rastogi, Peter Worth,
Abstract summary: Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited.<n>We extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies.<n>We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families.
Score: 4.077787659104315
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall, their performance remains subpar on reasoning-intensive tasks, such as threat actor attribution and risk mitigation, with open-source models trailing even further behind. These findings highlight fundamental limitations in the reasoning capabilities of current LLMs and underscore the need for models explicitly tailored to CTI workflows and automation.

Related papers

CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence [48.63397742510097]
Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats.<n>With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI.<n>We present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI.
arXiv Detail & Related papers (2025-10-13T22:10:17Z)
POLAR: Automating Cyber Threat Prioritization through LLM-Powered Assessment [13.18964488705143]
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats.<n>In this paper, we investigate the intrinsic vulnerabilities of LLMs in cyber threat intelligence (CTI)<n>We introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-02T00:49:20Z)
Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence [15.881854286231997]
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats.<n>In this paper, we investigate the intrinsic vulnerabilities of LLMs in cyber threat intelligence (CTI)<n>We introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision.
arXiv Detail & Related papers (2025-09-28T02:08:27Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Large Language Models are Unreliable for Cyber Threat Intelligence [12.091163364089782]
Large Language Models (LLMs) can be used to tame the data deluge in the cybersecurity field.<n>We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports.<n>We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident.
arXiv Detail & Related papers (2025-03-29T18:09:36Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence [0.7499722271664147]
CTIBench is a benchmark designed to assess Large Language Models' performance in CTI applications. Our evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses in CTI contexts.
arXiv Detail & Related papers (2024-06-11T16:42:02Z)
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.<n>It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.<n>Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.