SECURE: Benchmarking Large Language Models for Cybersecurity
- URL: http://arxiv.org/abs/2405.20441v4
- Date: Wed, 30 Oct 2024 14:29:37 GMT
- Title: SECURE: Benchmarking Large Language Models for Cybersecurity
- Authors: Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Lightcap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, Benjamin A. Blakely, Nidhi Rastogi,
- Abstract summary: Large Language Models (LLMs) have demonstrated potential in cybersecurity applications but have also caused lower confidence due to problems like hallucinations and a lack of truthfulness.
Our study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts.
- Score: 0.6741087029030101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated potential in cybersecurity applications but have also caused lower confidence due to problems like hallucinations and a lack of truthfulness. Existing benchmarks provide general evaluations but do not sufficiently address the practical and applied aspects of LLM performance in cybersecurity-specific tasks. To address this gap, we introduce the SECURE (Security Extraction, Understanding \& Reasoning Evaluation), a benchmark designed to assess LLMs performance in realistic cybersecurity scenarios. SECURE includes six datasets focussed on the Industrial Control System sector to evaluate knowledge extraction, understanding, and reasoning based on industry-standard sources. Our study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts, and offer recommendations for improving LLMs reliability as cyber advisory tools.
Related papers
- The Digital Cybersecurity Expert: How Far Have We Come? [49.89857422097055]
We develop CSEBenchmark, a fine-grained cybersecurity evaluation framework based on 345 knowledge points expected of cybersecurity experts.
We evaluate 12 popular large language models (LLMs) on CSEBenchmark and find that even the best-performing model achieves only 85.42% overall accuracy.
By identifying and addressing specific knowledge gaps in each LLM, we achieve up to an 84% improvement in correcting previously incorrect predictions.
arXiv Detail & Related papers (2025-04-16T05:36:28Z) - A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations [127.52707312573791]
This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods.
We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs.
We conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results.
arXiv Detail & Related papers (2025-02-14T08:42:43Z) - LLM Cyber Evaluations Don't Capture Real-World Risk [0.0]
Large language models (LLMs) are demonstrating increasing prowess in cybersecurity applications.
We argue that current efforts to evaluate risks posed by these capabilities are misaligned with the goal of understanding real-world impact.
arXiv Detail & Related papers (2025-01-31T05:33:48Z) - ChatNVD: Advancing Cybersecurity Vulnerability Assessment with Large Language Models [0.46873264197900916]
This paper explores the potential application of Large Language Models (LLMs) to enhance the assessment of software vulnerabilities.
We develop three variants of ChatNVD, utilizing three prominent LLMs: GPT-4o mini by OpenAI, Llama 3 by Meta, and Gemini 1.5 Pro by Google.
To evaluate their efficacy, we conduct a comparative analysis of these models using a comprehensive questionnaire comprising common security vulnerability questions.
arXiv Detail & Related papers (2024-12-06T03:45:49Z) - Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset [94.13848736705575]
We introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms.
We apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels.
Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance.
arXiv Detail & Related papers (2024-11-05T23:26:10Z) - SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs [75.85283891591678]
Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges.
Large language models (LLMs) increasingly assist in tasks ranging from procedural guidance to autonomous experiment orchestration.
Such overreliance is especially hazardous in high-stakes laboratory settings, where failures in hazard identification or risk assessment can result in severe accidents.
We propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive framework that evaluates LLMs and vision language models (VLMs) on their ability to identify potential hazards, assess risks, and predict the consequences of unsafe actions in lab environments.
arXiv Detail & Related papers (2024-10-18T05:21:05Z) - Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection [9.652886240532741]
This paper thoroughly analyses large language models' capabilities in detecting vulnerabilities within source code.
We evaluate the performance of six open-source models that are specifically trained for vulnerability detection against six general-purpose LLMs.
arXiv Detail & Related papers (2024-08-29T10:00:57Z) - CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence [0.7499722271664147]
CTIBench is a benchmark designed to assess Large Language Models' performance in CTI applications.
Our evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses in CTI contexts.
arXiv Detail & Related papers (2024-06-11T16:42:02Z) - Ollabench: Evaluating LLMs' Reasoning for Human-centric Interdependent Cybersecurity [0.0]
Large Language Models (LLMs) have the potential to enhance Agent-Based Modeling by better representing complex interdependent cybersecurity systems.
Existing evaluation frameworks often overlook the human factor and cognitive computing capabilities essential for interdependent cybersecurity.
I propose OllaBench, a novel evaluation framework that assesses LLMs' accuracy, wastefulness, and consistency in answering scenario-based information security compliance and non-compliance questions.
arXiv Detail & Related papers (2024-06-11T00:35:39Z) - Generative AI and Large Language Models for Cyber Security: All Insights You Need [0.06597195879147556]
This paper provides a comprehensive review of the future of cybersecurity through Generative AI and Large Language Models (LLMs)
We explore LLM applications across various domains, including hardware design security, intrusion detection, software engineering, design verification, cyber threat intelligence, malware detection, and phishing detection.
We present an overview of LLM evolution and its current state, focusing on advancements in models such as GPT-4, GPT-3.5, Mixtral-8x7B, BERT, Falcon2, and LLaMA.
arXiv Detail & Related papers (2024-05-21T13:02:27Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Data Poisoning for In-context Learning [49.77204165250528]
In-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks.
This paper delves into the critical issue of ICL's susceptibility to data poisoning attacks.
We introduce ICLPoison, a specialized attacking framework conceived to exploit the learning mechanisms of ICL.
arXiv Detail & Related papers (2024-02-03T14:20:20Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models [41.068780235482514]
This paper presents CyberSecEval, a comprehensive benchmark developed to help bolster the cybersecurity of Large Language Models (LLMs) employed as coding assistants.
CyberSecEval provides a thorough evaluation of LLMs in two crucial security domains: their propensity to generate insecure code and their level of compliance when asked to assist in cyberattacks.
arXiv Detail & Related papers (2023-12-07T22:07:54Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.