Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
- URL: http://arxiv.org/abs/2312.06674v1
- Date: Thu, 7 Dec 2023 19:40:50 GMT
- Title: Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
- Authors: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika
Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine,
Madian Khabsa
- Abstract summary: We introduce Llama Guard, an input-output safeguard model geared towards Human-AI conversation use cases.
Llama Guard incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks.
Llama Guard demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat.
- Score: 29.32704733570445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Llama Guard, an LLM-based input-output safeguard model geared
towards Human-AI conversation use cases. Our model incorporates a safety risk
taxonomy, a valuable tool for categorizing a specific set of safety risks found
in LLM prompts (i.e., prompt classification). This taxonomy is also
instrumental in classifying the responses generated by LLMs to these prompts, a
process we refer to as response classification. For the purpose of both prompt
and response classification, we have meticulously gathered a dataset of high
quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our
collected dataset, albeit low in volume, demonstrates strong performance on
existing benchmarks such as the OpenAI Moderation Evaluation dataset and
ToxicChat, where its performance matches or exceeds that of currently available
content moderation tools. Llama Guard functions as a language model, carrying
out multi-class classification and generating binary decision scores.
Furthermore, the instruction fine-tuning of Llama Guard allows for the
customization of tasks and the adaptation of output formats. This feature
enhances the model's capabilities, such as enabling the adjustment of taxonomy
categories to align with specific use cases, and facilitating zero-shot or
few-shot prompting with diverse taxonomies at the input. We are making Llama
Guard model weights available and we encourage researchers to further develop
and adapt them to meet the evolving needs of the community for AI safety.
Related papers
- Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification [29.74457390987092]
We propose a novel framework to identify and regularize unintended features in large language models (LLMs) latent spaces.
We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis.
arXiv Detail & Related papers (2025-02-19T22:27:59Z) - Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering [66.5524727179286]
NOVA is a framework designed to identify high-quality data that aligns well with the learned knowledge to reduce hallucinations.
It includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data.
To ensure the quality of selected samples, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity.
arXiv Detail & Related papers (2025-02-11T08:05:56Z) - Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation [15.298017013140385]
We propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG)
Compared to model fine-tuning, Class-RAG demonstrates flexibility and transparency in decision-making, outperforms on classification and is more robust against adversarial attack.
Our findings also suggest that Class-RAG performance scales with retrieval library size, indicating that increasing the library size is a viable and low-cost approach to improve content moderation.
arXiv Detail & Related papers (2024-10-18T22:07:36Z) - Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels.
This study explores whether solely utilizing unlabeled data can elicit strong model capabilities.
We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.