Related papers: CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

URL: http://arxiv.org/abs/2404.13161v1
Date: Fri, 19 Apr 2024 20:11:12 GMT
Title: CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models
Authors: Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, Joshua Saxe,
Abstract summary: Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
Score: 6.931433424951554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We introduce two new areas for testing: prompt injection and code interpreter abuse. We evaluated multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama. Our results show that conditioning away risk of attack remains an unsolved problem; for example, all tested models showed between 26% and 41% successful prompt injection tests. We further introduce the safety-utility tradeoff: conditioning an LLM to reject unsafe prompts can cause the LLM to falsely reject answering benign prompts, which lowers utility. We propose quantifying this tradeoff using False Refusal Rate (FRR). As an illustration, we introduce a novel test set to quantify FRR for cyberattack helpfulness risk. We find many LLMs able to successfully comply with "borderline" benign requests while still rejecting most unsafe requests. Finally, we quantify the utility of LLMs for automating a core cybersecurity task, that of exploiting software vulnerabilities. This is important because the offensive capabilities of LLMs are of intense interest; we quantify this by creating novel test sets for four representative problems. We find that models with coding capabilities perform better than those without, but that further work is needed for LLMs to become proficient at exploit generation. Our code is open source and can be used to evaluate other LLMs.

Related papers

Automated Consistency Analysis of LLMs [0.1747820331822631]
Generative AI with large language models (LLMs) are being widely adopted across the industry, academia and government. One of the key challenge to the trustworthiness and reliability of LLMs is: how consistent an LLM is in its responses. This paper proposes two approaches to validate consistency: self-validation, and validation across multiple LLMs.
arXiv Detail & Related papers (2025-02-10T21:03:24Z)
ASTRAL: Automated Safety Testing of Large Language Models [6.1050306667733185]
Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content. We present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs.
arXiv Detail & Related papers (2025-01-28T18:25:11Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored. We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse. We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants [14.947665219536708]
We introduce the Malicious Programming Prompt (MaPP) attack, in which an attacker adds a small amount of text to a prompt for a programming task. We show that our prompt strategy can cause an LLM to add vulnerabilities while continuing to write otherwise correct code.
arXiv Detail & Related papers (2024-07-12T22:30:35Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests. First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
A Chinese Dataset for Evaluating the Safeguards in Large Language Models [46.43476815725323]
Large language models (LLMs) can produce harmful responses. This paper introduces a dataset for the safety evaluation of Chinese LLMs. We then extend it to two other scenarios that can be used to better identify false negative and false positive examples.
arXiv Detail & Related papers (2024-02-19T14:56:18Z)
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly [21.536079040559517]
Large Language Models (LLMs) have revolutionized natural language understanding and generation. This paper explores the intersection of LLMs with security and privacy.
arXiv Detail & Related papers (2023-12-04T16:25:18Z)
Can LLMs Patch Security Issues? [1.3299507495084417]
Large Language Models (LLMs) have shown impressive proficiency in code generation. LLMs share a weakness with their human counterparts: producing code that inadvertently has security vulnerabilities. We propose Feedback-Driven Security Patching (FDSP), where LLMs automatically refine generated, vulnerable code.
arXiv Detail & Related papers (2023-11-13T08:54:37Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models. We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.