Related papers: $R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

URL: http://arxiv.org/abs/2407.05557v1
Date: Mon, 8 Jul 2024 02:15:29 GMT
Title: $R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Authors: Mintong Kang, Bo Li,
Abstract summary: Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. We propose $R2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. $R2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.
Score: 8.408258504178718
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LLMs become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs. Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories. To address these limitations, we propose $R^2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, $R^2$-Guard comprises two parts: data-driven category-specific learning and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories. We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) based reasoning component. The unsafety probabilities of different categories from data-driven guardrail models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. To further perform stress tests for guardrail models, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories. We demonstrate the effectiveness of $R^2$-Guard by comparisons with eight strong guardrail models on six safety benchmarks, and demonstrate the robustness of $R^2$-Guard against four SOTA jailbreaking attacks. $R^2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.

Related papers

No Free Lunch with Guardrails [0.0]
We evaluate whether current guardrails effectively prevent misuse while maintaining practical utility. Our findings confirm that there is no free lunch with guardrails; strengthening security often comes at the cost of usability. We propose a blueprint for designing better guardrails that minimize risk while maintaining usability.
arXiv Detail & Related papers (2025-04-01T05:46:54Z)
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance. We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z)
BingoGuard: LLM Content Moderation Tools with Risk Levels [67.53167973090356]
Malicious content generated by large language models (LLMs) can pose varying degrees of harm. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system.
arXiv Detail & Related papers (2025-03-09T10:43:09Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [33.96886111900147]
ThinkGuard is a critique-augmented guardrail model that distills knowledge from high-capacity language models. It achieves the highest average F1 and AUPRC, outperforming all baselines. It surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning.
arXiv Detail & Related papers (2025-02-19T06:09:58Z)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.658844160259104]
Large language models (LLMs) have demonstrated immense utility across various industries. As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z)
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models [92.85175340702125]
We distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. We propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Our HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
arXiv Detail & Related papers (2024-10-02T13:12:13Z)
Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning [1.3307486544794784]
Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.
arXiv Detail & Related papers (2024-09-18T08:04:24Z)
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing [1.474945380093949]
Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance. Current methods struggle in balancing safety with helpfulness. We propose PrimeGuard, a novel ITG method that utilizes structured control flow.
arXiv Detail & Related papers (2024-07-23T09:14:27Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety. WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance [48.80398992974831]
SafeAligner is a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We develop two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. We show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones.
arXiv Detail & Related papers (2024-06-26T07:15:44Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
Safe Linear Bandits over Unknown Polytopes [39.177982674455784]
The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes.
arXiv Detail & Related papers (2022-09-27T21:13:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.