Related papers: CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

URL: http://arxiv.org/abs/2402.07688v2
Date: Mon, 3 Jun 2024 08:14:45 GMT
Title: CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
Authors: Norbert Tihanyi, Mohamed Amine Ferrag, Ridhi Jain, Tamas Bisztray, Merouane Debbah,
Abstract summary: Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. To accurately test the general knowledge of LLMs in cybersecurity, the research community needs a diverse, accurate, and up-to-date dataset. We present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets.
Score: 2.0893807243791636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different fields of cybersecurity, which includes topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. To accurately test the general knowledge of LLMs in cybersecurity, the research community needs a diverse, accurate, and up-to-date dataset. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance, and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b.

Related papers

Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report [50.268821168513654]
We present Foundation-Sec-8B, a cybersecurity-focused large language model (LLMs) built on the Llama 3.1 architecture. We evaluate it across both established and new cybersecurity benchmarks, showing that it matches Llama 3.1-70B and GPT-4o-mini in certain cybersecurity-specific tasks. By releasing our model to the public, we aim to accelerate progress and adoption of AI-driven tools in both public and private cybersecurity contexts.
arXiv Detail & Related papers (2025-04-28T08:41:12Z)
The Digital Cybersecurity Expert: How Far Have We Come? [49.89857422097055]
We develop CSEBenchmark, a fine-grained cybersecurity evaluation framework based on 345 knowledge points expected of cybersecurity experts. We evaluate 12 popular large language models (LLMs) on CSEBenchmark and find that even the best-performing model achieves only 85.42% overall accuracy. By identifying and addressing specific knowledge gaps in each LLM, we achieve up to an 84% improvement in correcting previously incorrect predictions.
arXiv Detail & Related papers (2025-04-16T05:36:28Z)
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? [64.62421656031128]
MLRC-Bench is a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions. Unlike prior work, MLRC-Bench measures the key steps of proposing and implementing novel research methods. Even the best-performing tested agent closes only 9.3% of the gap between baseline and top human participant scores.
arXiv Detail & Related papers (2025-04-13T19:35:43Z)
CyberLLMInstruct: A New Dataset for Analysing Safety of Fine-Tuned LLMs Using Cyber Security Data [2.2530496464901106]
integration of large language models into cyber security applications presents significant opportunities. CyberLLMInstruct is a dataset of 54,928 instruction-response pairs spanning cyber security tasks. Fine-tuning models can achieve up to 92.50 percent accuracy on the CyberMetric benchmark.
arXiv Detail & Related papers (2025-03-12T12:29:27Z)
Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training [1.5029560229270191]
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. We present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation. Continual pre-training on our dataset yields a 15.88% improvement in the aggregate score, while reasoning distillation leads to a 10% gain in security certification.
arXiv Detail & Related papers (2025-02-16T16:34:49Z)
Humanity's Last Exam [434.8511341499966]
Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge. It consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval.
arXiv Detail & Related papers (2025-01-24T05:27:46Z)
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity [23.32017147274093]
SecBench is a benchmarking dataset designed to evaluate Large Language Models (LLMs) in the cybersecurity domain. The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest. Benchmarking results on 16 SOTA LLMs demonstrate the usability of SecBench.
arXiv Detail & Related papers (2024-12-30T08:11:54Z)
CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity [25.07282324266835]
CS-Eval is a benchmark for large language models (LLMs) in cybersecurity. It synthesizes research hotspots from academia and practical applications from industry. It organizes high-quality questions into three cognitive levels: knowledge, ability, and application.
arXiv Detail & Related papers (2024-11-25T09:54:42Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
Commit message generation is a crucial task in software engineering that is challenging to evaluate correctly. We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z)
CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions [0.2999888908665658]
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) capabilities, providing versatile capabilities across various applications. However, their application to complex, domain-specific tasks, such as cyber-security, often faces substantial challenges. In this study, we introduce SecKnowledge and CyberPal.AI to address these challenges and train security-expert LLMs.
arXiv Detail & Related papers (2024-08-17T22:37:39Z)
The Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums [0.0]
Large language models (LLMs) can be used to analyze cyber threat intelligence (CTI) data from cybercrime forums. This study assesses the performance of an LLM system built on the OpenAI GPT-3.5-turbo model [8] to extract CTI information.
arXiv Detail & Related papers (2024-08-06T09:15:25Z)
Large Language Models for Cyber Security: A Systematic Literature Review [14.924782327303765]
We conduct a comprehensive review of the literature on the application of Large Language Models in cybersecurity (LLM4Security) We observe that LLMs are being applied to a wide range of cybersecurity tasks, including vulnerability detection, malware analysis, network intrusion detection, and phishing detection. Third, we identify several promising techniques for adapting LLMs to specific cybersecurity domains, such as fine-tuning, transfer learning, and domain-specific pre-training.
arXiv Detail & Related papers (2024-05-08T02:09:17Z)
When LLMs Meet Cybersecurity: A Systematic Literature Review [9.347716970758604]
Large language models (LLMs) have opened new avenues across various fields, including cybersecurity. There is a lack of a comprehensive overview of this research area. This study aims to shed light on the extensive potential of LLMs in enhancing cybersecurity practices.
arXiv Detail & Related papers (2024-05-06T17:07:28Z)
Long-form factuality in large language models [60.07181269469043]
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. We benchmark a model's long-form factuality in open domains, using GPT-4 to generate LongFact. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE)
arXiv Detail & Related papers (2024-03-27T17:48:55Z)
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning [87.1610740406279]
White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. Current evaluations are private, preventing further research into mitigating risk. We publicly release the Weapons of Mass Destruction Proxy benchmark, a dataset of 3,668 multiple-choice questions.
arXiv Detail & Related papers (2024-03-05T18:59:35Z)
RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge [69.79676144482792]
This study aims to evaluate the ability of LLMs to distinguish reliable information from external knowledge. Our benchmark consists of two tasks, Question Answering and Text Generation, and for each task, we provide models with a context containing counterfactual information.
arXiv Detail & Related papers (2023-11-14T13:24:19Z)
A Survey on Detection of LLMs-Generated Content [97.87912800179531]
The ability to detect LLMs-generated content has become of paramount importance. We aim to provide a detailed overview of existing detection strategies and benchmarks. We also posit the necessity for a multi-faceted approach to defend against various attacks.
arXiv Detail & Related papers (2023-10-24T09:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.