Large Language Models are Unreliable for Cyber Threat Intelligence
- URL: http://arxiv.org/abs/2503.23175v1
- Date: Sat, 29 Mar 2025 18:09:36 GMT
- Title: Large Language Models are Unreliable for Cyber Threat Intelligence
- Authors: Emanuele Mezzi, Fabio Massacci, Katja Tuma,
- Abstract summary: Large Language Models (LLMs) can be used to tame the data deluge in the cybersecurity field.<n>We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports.<n>We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident.
- Score: 6.218417024312705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several recent works have argued that Large Language Models (LLMs) can be used to tame the data deluge in the cybersecurity field, by improving the automation of Cyber Threat Intelligence (CTI) tasks. This work presents an evaluation methodology that other than allowing to test LLMs on CTI tasks when using zero-shot learning, few-shot learning and fine-tuning, also allows to quantify their consistency and their confidence level. We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports and present new evidence of potential security risks in relying on LLMs for CTI. We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident. Few-shot learning and fine-tuning only partially improve the results, thus posing doubts about the possibility of using LLMs for CTI scenarios, where labelled datasets are lacking and where confidence is a fundamental factor.
Related papers
- Multi-Task Learning with LLMs for Implicit Sentiment Analysis: Data-level and Task-level Automatic Weight Learning [18.836998294161834]
Implicit sentiment analysis presents significant challenges due to the absence of salient cue words.
We introduce MT-ISA, a novel MTL framework that enhances ISA by leveraging the generation and reasoning capabilities of large language models.
We introduce data-level and task-level automatic weight learning (AWL), which dynamically identifies relationships and prioritizes more reliable data and critical tasks.
arXiv Detail & Related papers (2024-12-12T08:15:16Z) - The Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums [0.0]
Large language models (LLMs) can be used to analyze cyber threat intelligence (CTI) data from cybercrime forums.
This study assesses the performance of an LLM system built on the OpenAI GPT-3.5-turbo model [8] to extract CTI information.
arXiv Detail & Related papers (2024-08-06T09:15:25Z) - Can LLMs be Fooled? Investigating Vulnerabilities in LLMs [4.927763944523323]
The advent of Large Language Models (LLMs) has garnered significant popularity and wielded immense power across various domains within Natural Language Processing (NLP)
This paper will synthesize the findings from each vulnerability section and propose new directions of research and development.
By understanding the focal points of current vulnerabilities, we can better anticipate and mitigate future risks.
arXiv Detail & Related papers (2024-07-30T04:08:00Z) - How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency [60.25969380388974]
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs)<n>Current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance.<n>We propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score.
arXiv Detail & Related papers (2024-07-18T15:20:18Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence [0.7499722271664147]
CTIBench is a benchmark designed to assess Large Language Models' performance in CTI applications.
Our evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses in CTI contexts.
arXiv Detail & Related papers (2024-06-11T16:42:02Z) - Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models [8.167614500821223]
We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction.
Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul) with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset.
arXiv Detail & Related papers (2024-06-09T19:18:05Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Characterizing Truthfulness in Large Language Model Generations with
Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs)
We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.