The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness
- URL: http://arxiv.org/abs/2401.00287v1
- Date: Sat, 30 Dec 2023 17:37:06 GMT
- Title: The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness
- Authors: Neeraj Varshney, Pavel Dolin, Agastya Seth, Chitta Baral
- Abstract summary: Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
- Score: 56.174255970895466
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As Large Language Models (LLMs) play an increasingly pivotal role in natural
language processing applications, their safety concerns become critical areas
of NLP research. This paper presents Safety and Over-Defensiveness Evaluation
(SODE) benchmark: a collection of diverse safe and unsafe prompts with
carefully designed evaluation methods that facilitate systematic evaluation,
comparison, and analysis over 'safety' and 'over-defensiveness.' With SODE, we
study a variety of LLM defense strategies over multiple state-of-the-art LLMs,
which reveals several interesting and important findings, such as (a) the
widely popular 'self-checking' techniques indeed improve the safety against
unsafe inputs, but this comes at the cost of extreme over-defensiveness on the
safe inputs, (b) providing a safety instruction along with in-context exemplars
(of both safe and unsafe inputs) consistently improves safety and also
mitigates undue over-defensiveness of the models, (c) providing contextual
knowledge easily breaks the safety guardrails and makes the models more
vulnerable to generating unsafe responses. Overall, our work reveals numerous
such critical findings that we believe will pave the way and facilitate further
research in improving the safety of LLMs.
Related papers
- Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.
For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.
We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - Feasibility Consistent Representation Learning for Safe Reinforcement Learning [25.258227763316228]
We introduce a novel framework named Feasibility Consistent Safe Reinforcement Learning (FCSRL)
This framework combines representation learning with feasibility-oriented objectives to identify and extract safety-related information from the raw state for safe RL.
Our method is capable of learning a better safety-aware embedding and achieving superior performance than previous representation learning baselines.
arXiv Detail & Related papers (2024-05-20T01:37:21Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z) - Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements [76.80453043969209]
This survey presents a framework for safety research pertaining to large models.
We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models.
We explore the strategies for enhancing large model safety from training to deployment.
arXiv Detail & Related papers (2023-02-18T09:32:55Z) - Evaluating Model-free Reinforcement Learning toward Safety-critical
Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL.
We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection.
To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.