SafeText: A Benchmark for Exploring Physical Safety in Language Models
- URL: http://arxiv.org/abs/2210.10045v1
- Date: Tue, 18 Oct 2022 17:59:31 GMT
- Title: SafeText: A Benchmark for Exploring Physical Safety in Language Models
- Authors: Sharon Levy, Emily Allaway, Melanie Subbiah, Lydia Chilton, Desmond
Patton, Kathleen McKeown, William Yang Wang
- Abstract summary: We study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks.
We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice.
- Score: 62.810902375154136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding what constitutes safe text is an important issue in natural
language processing and can often prevent the deployment of models deemed
harmful and unsafe. One such type of safety that has been scarcely studied is
commonsense physical safety, i.e. text that is not explicitly violent and
requires additional commonsense knowledge to comprehend that it leads to
physical harm. We create the first benchmark dataset, SafeText, comprising
real-life scenarios with paired safe and physically unsafe pieces of advice. We
utilize SafeText to empirically study commonsense physical safety across
various models designed for text generation and commonsense reasoning tasks. We
find that state-of-the-art large language models are susceptible to the
generation of unsafe text and have difficulty rejecting unsafe advice. As a
result, we argue for further studies of safety and the assessment of
commonsense physical safety in models before release.
Related papers
- Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.
For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.
We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models [5.6874111521946356]
Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms.
We propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy.
HarmEval is a novel benchmark for extensive safety evaluations.
arXiv Detail & Related papers (2024-06-18T05:03:23Z) - Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations [19.132597762214722]
Current alignment methods struggle with dynamic user intentions and complex objectives.
We propose Safety Arithmetic, a training-free framework enhancing safety across different scenarios.
Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility.
arXiv Detail & Related papers (2024-06-17T17:48:13Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models [42.19184265811366]
We introduce a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs.
We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences.
arXiv Detail & Related papers (2023-11-27T19:02:17Z) - All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs)
XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families.
We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z) - Foveate, Attribute, and Rationalize: Towards Physically Safe and
Trustworthy AI [76.28956947107372]
Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful.
We propose FARM, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety.
Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9%.
arXiv Detail & Related papers (2022-12-19T17:51:47Z) - Mitigating Covertly Unsafe Text within Natural Language Systems [55.26364166702625]
Uncontrolled systems may generate recommendations that lead to injury or life-threatening consequences.
In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text.
arXiv Detail & Related papers (2022-10-17T17:59:49Z) - On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark [42.322782754346406]
We propose a taxonomy for dialogue safety specifically designed to capture unsafe behaviors that are unique in human-bot dialogue setting.
We compile DiaSafety, a dataset of 6 unsafe categories with rich context-sensitive unsafe examples.
Experiments show that existing utterance-level safety tools guarding fail catastrophically on our dataset.
arXiv Detail & Related papers (2021-10-16T04:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.