Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements
- URL: http://arxiv.org/abs/2302.09270v3
- Date: Thu, 30 Nov 2023 06:39:19 GMT
- Title: Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements
- Authors: Jiawen Deng, Jiale Cheng, Hao Sun, Zhexin Zhang, Minlie Huang
- Abstract summary: This survey presents a framework for safety research pertaining to large models.
We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models.
We explore the strategies for enhancing large model safety from training to deployment.
- Score: 76.80453043969209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As generative large model capabilities advance, safety concerns become more
pronounced in their outputs. To ensure the sustainable growth of the AI
ecosystem, it's imperative to undertake a holistic evaluation and refinement of
associated safety risks. This survey presents a framework for safety research
pertaining to large models, delineating the landscape of safety risks as well
as safety evaluation and improvement methods. We begin by introducing safety
issues of wide concern, then delve into safety evaluation methods for large
models, encompassing preference-based testing, adversarial attack approaches,
issues detection, and other advanced evaluation methods. Additionally, we
explore the strategies for enhancing large model safety from training to
deployment, highlighting cutting-edge safety approaches for each stage in
building large models. Finally, we discuss the core challenges in advancing
towards more responsible AI, including the interpretability of safety
mechanisms, ongoing safety issues, and robustness against malicious attacks.
Through this survey, we aim to provide clear technical guidance for safety
researchers and encourage further study on the safety of large models.
Related papers
- The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.
We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z) - Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications.
Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness.
Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z) - Safety at Scale: A Comprehensive Survey of Large Model Safety [299.801463557549]
We present a comprehensive taxonomy of safety threats to large models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats.
We identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices.
arXiv Detail & Related papers (2025-02-02T05:14:22Z) - Building Trust: Foundations of Security, Safety and Transparency in AI [0.23301643766310373]
We review the current security and safety scenarios while highlighting challenges such as tracking issues, remediation, and the apparent absence of AI model lifecycle and ownership processes.
This paper aims to provide some of the foundational pieces for more standardized security, safety, and transparency in the development and operation of AI models.
arXiv Detail & Related papers (2024-11-19T06:55:57Z) - EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.
Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.
However, the deployment of these agents in physical environments presents significant safety challenges.
This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context.
We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.