Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements
- URL: http://arxiv.org/abs/2302.09270v3
- Date: Thu, 30 Nov 2023 06:39:19 GMT
- Title: Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements
- Authors: Jiawen Deng, Jiale Cheng, Hao Sun, Zhexin Zhang, Minlie Huang
- Abstract summary: This survey presents a framework for safety research pertaining to large models.
We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models.
We explore the strategies for enhancing large model safety from training to deployment.
- Score: 76.80453043969209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As generative large model capabilities advance, safety concerns become more
pronounced in their outputs. To ensure the sustainable growth of the AI
ecosystem, it's imperative to undertake a holistic evaluation and refinement of
associated safety risks. This survey presents a framework for safety research
pertaining to large models, delineating the landscape of safety risks as well
as safety evaluation and improvement methods. We begin by introducing safety
issues of wide concern, then delve into safety evaluation methods for large
models, encompassing preference-based testing, adversarial attack approaches,
issues detection, and other advanced evaluation methods. Additionally, we
explore the strategies for enhancing large model safety from training to
deployment, highlighting cutting-edge safety approaches for each stage in
building large models. Finally, we discuss the core challenges in advancing
towards more responsible AI, including the interpretability of safety
mechanisms, ongoing safety issues, and robustness against malicious attacks.
Through this survey, we aim to provide clear technical guidance for safety
researchers and encourage further study on the safety of large models.
Related papers
- Cross-Modality Safety Alignment [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.
To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.
Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z) - Reconciling Safety Measurement and Dynamic Assurance [1.6574413179773757]
We propose a new framework to facilitate dynamic assurance within a safety case approach.
The focus is mainly on the safety architecture, whose underlying risk assessment model gives the concrete link from safety measurement to operational risk.
arXiv Detail & Related papers (2024-05-30T02:48:00Z) - Sok: Comprehensive Security Overview, Challenges, and Future Directions of Voice-Controlled Systems [10.86045604075024]
The integration of Voice Control Systems into smart devices accentuates the importance of their security.
Current research has uncovered numerous vulnerabilities in VCS, presenting significant risks to user privacy and security.
This study introduces a hierarchical model structure for VCS, providing a novel lens for categorizing and analyzing existing literature in a systematic manner.
We classify attacks based on their technical principles and thoroughly evaluate various attributes, such as their methods, targets, vectors, and behaviors.
arXiv Detail & Related papers (2024-05-27T12:18:46Z) - Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [88.80306881112313]
We will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI.
The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees.
We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them.
arXiv Detail & Related papers (2024-05-10T17:38:32Z) - From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards [4.0645651835677565]
We investigate the effectiveness of safety measures by evaluating models on already mitigated biases.
We create a set of non-toxic prompts, which we then use to evaluate Llama models.
We observe that the safety/helpfulness trade-offs are more pronounced for certain demographic groups which can lead to quality-of-service harms.
arXiv Detail & Related papers (2024-03-20T00:22:38Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Safeguarded Progress in Reinforcement Learning: Safe Bayesian
Exploration for Control Policy Synthesis [63.532413807686524]
This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL)
We propose a new architecture that handles the trade-off between efficient progress and safety during exploration.
arXiv Detail & Related papers (2023-12-18T16:09:43Z) - The Last Decade in Review: Tracing the Evolution of Safety Assurance
Cases through a Comprehensive Bibliometric Analysis [7.431812376079826]
Safety assurance is of paramount importance across various domains, including automotive, aerospace, and nuclear energy.
The use of safety assurance cases allows for verifying the correctness of the created systems capabilities, preventing system failure.
arXiv Detail & Related papers (2023-11-13T17:34:23Z) - Evaluating Model-free Reinforcement Learning toward Safety-critical
Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL.
We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection.
To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.