Related papers: XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

URL: http://arxiv.org/abs/2308.01263v3
Date: Mon, 1 Apr 2024 11:50:35 GMT
Title: XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Authors: Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy,
Abstract summary: We introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models.
Score: 34.75181539924584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.

Related papers

Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient. To avoid performance degradation and preserve safe performance, we advocate for a two-step framework. We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z)
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation. LLMs can inadvertently generate unsafe or biased responses when prompted with problematic inputs. This research addresses the critical challenge of developing language models that generate both helpful and harmless content.
arXiv Detail & Related papers (2024-11-26T06:52:22Z)
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations [19.132597762214722]
Current alignment methods struggle with dynamic user intentions and complex objectives. We propose Safety Arithmetic, a training-free framework enhancing safety across different scenarios. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility.
arXiv Detail & Related papers (2024-06-17T17:48:13Z)
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users [18.3621509910395]
We propose a novel Automatic Red-Teaming framework, ART, to evaluate the safety risks of text-to-image models. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. We also introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models.
arXiv Detail & Related papers (2024-05-24T07:44:27Z)
Position: Towards Implicit Prompt For Text-To-Image Models [57.00716011456852]
This paper highlights the current state of text-to-image (T2I) models toward implicit prompts. We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts. Experiment results show that T2I models are able to accurately create various target symbols indicated by implicit prompts.
arXiv Detail & Related papers (2024-03-04T15:21:51Z)
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts [16.317849859000074]
GuardT2I is a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Our experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator.
arXiv Detail & Related papers (2024-03-03T09:04:34Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models [42.19184265811366]
We introduce a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences.
arXiv Detail & Related papers (2023-11-27T19:02:17Z)
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models [15.896567445646784]
We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme.
arXiv Detail & Related papers (2023-11-14T18:33:43Z)
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [102.63973600144308]
Open-source large language models can be easily subverted to generate harmful content. Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
arXiv Detail & Related papers (2023-10-04T16:39:31Z)
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions [79.1824160877979]
We show that several popular instruction-tuned models are highly unsafe. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks.
arXiv Detail & Related papers (2023-09-14T17:23:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.