Safeguarding Large Language Models: A Survey
- URL: http://arxiv.org/abs/2406.02622v1
- Date: Mon, 3 Jun 2024 19:27:46 GMT
- Title: Safeguarding Large Language Models: A Survey
- Authors: Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, Xiaowei Huang,
- Abstract summary: "Safeguards" or "guardrails" have become imperative to ensure the ethical use of Large Language Models (LLMs) within prescribed boundaries.
This article provides a systematic literature review on the current status of this critical mechanism.
It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts.
- Score: 20.854570045229917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as "safeguards" or "guardrails", has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.
Related papers
- SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach [58.93030774141753]
Multimodal foundation models (MFMs) represent a significant advancement in artificial intelligence.
This paper conceptualizes cybersafety and cybersecurity in the context of multimodal learning.
We present a comprehensive Systematization of Knowledge (SoK) to unify these concepts in MFMs, identifying key threats.
arXiv Detail & Related papers (2024-11-17T23:06:20Z) - Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey [50.031628043029244]
Multimodal generative models are susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content.
This survey reviews jailbreak and defense in multimodal generative models.
arXiv Detail & Related papers (2024-11-14T07:51:51Z) - Jailbreaking and Mitigation of Vulnerabilities in Large Language Models [4.564507064383306]
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation.
Despite these advancements, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks.
This review analyzes the state of research on these vulnerabilities and presents available defense strategies.
arXiv Detail & Related papers (2024-10-20T00:00:56Z) - Recent Advances in Attack and Defense Approaches of Large Language Models [27.271665614205034]
Large Language Models (LLMs) have revolutionized artificial intelligence and machine learning through their advanced text processing and generating capabilities.
Their widespread deployment has raised significant safety and reliability concerns.
This paper reviews current research on LLM vulnerabilities and threats, and evaluates the effectiveness of contemporary defense mechanisms.
arXiv Detail & Related papers (2024-09-05T06:31:37Z) - Current state of LLM Risks and AI Guardrails [0.0]
Large language models (LLMs) have become increasingly sophisticated, leading to widespread deployment in sensitive applications where safety and reliability are paramount.
These risks necessitate the development of "guardrails" to align LLMs with desired behaviors and mitigate potential harm.
This work explores the risks associated with deploying LLMs and evaluates current approaches to implementing guardrails and model alignment techniques.
arXiv Detail & Related papers (2024-06-16T22:04:10Z) - A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law [65.87885628115946]
Large language models (LLMs) are revolutionizing the landscapes of finance, healthcare, and law.
We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies.
We critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems.
arXiv Detail & Related papers (2024-05-02T22:43:02Z) - Building Guardrails for Large Language Models [19.96292920696796]
Guardrails, which filter the inputs or outputs of LLMs, have emerged as a core safeguarding technology.
This position paper takes a deep look at current open-source solutions (Llama Guard, Nvidia NeMo, Guardrails AI) and discusses the challenges and the road towards building more complete solutions.
arXiv Detail & Related papers (2024-02-02T16:35:00Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Survey of Vulnerabilities in Large Language Models Revealed by
Adversarial Attacks [5.860289498416911]
Large Language Models (LLMs) are swiftly advancing in architecture and capability.
As they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows.
This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs.
arXiv Detail & Related papers (2023-10-16T21:37:24Z) - Representation Engineering: A Top-Down Approach to AI Transparency [132.0398250233924]
We identify and characterize the emerging area of representation engineering (RepE)
RepE places population-level representations, rather than neurons or circuits, at the center of analysis.
We showcase how these methods can provide traction on a wide range of safety-relevant problems.
arXiv Detail & Related papers (2023-10-02T17:59:07Z) - Inspect, Understand, Overcome: A Survey of Practical Methods for AI
Safety [54.478842696269304]
The use of deep neural networks (DNNs) in safety-critical applications is challenging due to numerous model-inherent shortcomings.
In recent years, a zoo of state-of-the-art techniques aiming to address these safety concerns has emerged.
Our paper addresses both machine learning experts and safety engineers.
arXiv Detail & Related papers (2021-04-29T09:54:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.