SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in
Large Language Models
- URL: http://arxiv.org/abs/2311.08370v2
- Date: Fri, 16 Feb 2024 09:42:19 GMT
- Title: SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in
Large Language Models
- Authors: Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand
Kannappan, Scott A. Hale, Paul R\"ottger
- Abstract summary: We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks.
The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with.
While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme.
- Score: 15.896567445646784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The past year has seen rapid acceleration in the development of large
language models (LLMs). However, without proper steering and safeguards, LLMs
will readily follow malicious instructions, provide unsafe advice, and generate
toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for
rapidly and systematically identifying such critical safety risks. The test
suite comprises 100 test prompts across five harm areas that LLMs, for the vast
majority of applications, should refuse to comply with. We test 11 open-access
and open-source LLMs and four closed-source LLMs, and find critical safety
weaknesses. While some of the models do not give a single unsafe response, most
give unsafe responses to more than 20% of the prompts, with over 50% unsafe
responses in the extreme. Prepending a safety-emphasising system prompt
substantially reduces the occurrence of unsafe responses, but does not
completely stop them from happening. Trained annotators labelled every model
response to SST (n = 3,000). We use these annotations to evaluate five AI
safety filters (which assess whether a models' response is unsafe given a
prompt) as a way of automatically evaluating models' performance on SST. The
filters' performance varies considerably. There are also differences across the
five harm areas, and on the unsafe versus safe responses. The widely-used
Perspective API has 72% accuracy and a newly-created zero-shot prompt to
OpenAI's GPT-4 performs best with 89% accuracy. Content Warning: This paper
contains prompts and responses that relate to child abuse, suicide, self-harm
and eating disorders, scams and fraud, illegal items, and physical harm.
Related papers
- CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs [4.441767341563709]
We introduce a safety assessment benchmark, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions.
This test set was used to evaluate the natural language generation capabilities of large language models (LLMs)
The results indicate that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement.
arXiv Detail & Related papers (2024-10-29T03:25:20Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.
This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts"
We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families.
arXiv Detail & Related papers (2024-05-31T15:44:33Z) - CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [6.931433424951554]
Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks.
We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities.
We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
arXiv Detail & Related papers (2024-04-19T20:11:12Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - A Chinese Dataset for Evaluating the Safeguards in Large Language Models [46.43476815725323]
Large language models (LLMs) can produce harmful responses.
This paper introduces a dataset for the safety evaluation of Chinese LLMs.
We then extend it to two other scenarios that can be used to better identify false negative and false positive examples.
arXiv Detail & Related papers (2024-02-19T14:56:18Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models [34.75181539924584]
We introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours.
We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models.
arXiv Detail & Related papers (2023-08-02T16:30:40Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.