Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements
- URL: http://arxiv.org/abs/2302.09270v3
- Date: Thu, 30 Nov 2023 06:39:19 GMT
- Title: Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements
- Authors: Jiawen Deng, Jiale Cheng, Hao Sun, Zhexin Zhang, Minlie Huang
- Abstract summary: This survey presents a framework for safety research pertaining to large models.
We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models.
We explore the strategies for enhancing large model safety from training to deployment.
- Score: 76.80453043969209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As generative large model capabilities advance, safety concerns become more
pronounced in their outputs. To ensure the sustainable growth of the AI
ecosystem, it's imperative to undertake a holistic evaluation and refinement of
associated safety risks. This survey presents a framework for safety research
pertaining to large models, delineating the landscape of safety risks as well
as safety evaluation and improvement methods. We begin by introducing safety
issues of wide concern, then delve into safety evaluation methods for large
models, encompassing preference-based testing, adversarial attack approaches,
issues detection, and other advanced evaluation methods. Additionally, we
explore the strategies for enhancing large model safety from training to
deployment, highlighting cutting-edge safety approaches for each stage in
building large models. Finally, we discuss the core challenges in advancing
towards more responsible AI, including the interpretability of safety
mechanisms, ongoing safety issues, and robustness against malicious attacks.
Through this survey, we aim to provide clear technical guidance for safety
researchers and encourage further study on the safety of large models.
Related papers
- Defining and Evaluating Physical Safety for Large Language Models [62.4971588282174]
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones.
Their risks of causing physical threats and harm in real-world applications remain unexplored.
We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations.
arXiv Detail & Related papers (2024-11-04T17:41:25Z) - EAIRiskBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [47.69642609574771]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.
Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.
However, the deployment of these agents in physical environments presents significant safety challenges.
This study introduces EAIRiskBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context.
We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z) - Reconciling Safety Measurement and Dynamic Assurance [1.6574413179773757]
We propose a new framework to facilitate dynamic assurance within a safety case approach.
The focus is mainly on the safety architecture, whose underlying risk assessment model gives the concrete link from safety measurement to operational risk.
arXiv Detail & Related papers (2024-05-30T02:48:00Z) - Sok: Comprehensive Security Overview, Challenges, and Future Directions of Voice-Controlled Systems [10.86045604075024]
The integration of Voice Control Systems into smart devices accentuates the importance of their security.
Current research has uncovered numerous vulnerabilities in VCS, presenting significant risks to user privacy and security.
This study introduces a hierarchical model structure for VCS, providing a novel lens for categorizing and analyzing existing literature in a systematic manner.
We classify attacks based on their technical principles and thoroughly evaluate various attributes, such as their methods, targets, vectors, and behaviors.
arXiv Detail & Related papers (2024-05-27T12:18:46Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Safeguarded Progress in Reinforcement Learning: Safe Bayesian
Exploration for Control Policy Synthesis [63.532413807686524]
This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL)
We propose a new architecture that handles the trade-off between efficient progress and safety during exploration.
arXiv Detail & Related papers (2023-12-18T16:09:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.