GuardReasoner: Towards Reasoning-based LLM Safeguards
- URL: http://arxiv.org/abs/2501.18492v1
- Date: Thu, 30 Jan 2025 17:06:06 GMT
- Title: GuardReasoner: Towards Reasoning-based LLM Safeguards
- Authors: Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi,
- Abstract summary: This paper proposes GuardReasoner, a new safeguard for LLMs.
We first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps.
Then, we introduce reasoning SFT to unlock the reasoning capability of guard models.
In this manner, GuardReasoner achieves better performance, explainability, and generalizability.
- Score: 63.53800124080227
- License:
- Abstract: As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.
Related papers
- Efficient Safety Retrofitting Against Jailbreaking for LLMs [0.4711628883579317]
Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data.
This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs.
arXiv Detail & Related papers (2025-02-19T10:33:18Z) - ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [33.96886111900147]
ThinkGuard is a critique-augmented guardrail model that distills knowledge from high-capacity language models.
It achieves the highest average F1 and AUPRC, outperforming all baselines.
It surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning.
arXiv Detail & Related papers (2025-02-19T06:09:58Z) - SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [21.317245896641136]
Long chain-of-thought (CoT) reasoning generates structured intermediate steps, enhancing reasoning capabilities.
Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs.
arXiv Detail & Related papers (2025-02-17T16:57:56Z) - LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [53.84130385074551]
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT)
We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA)
With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
arXiv Detail & Related papers (2025-02-11T08:48:48Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - $R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning [8.408258504178718]
Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them.
We propose $R2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning.
$R2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.
arXiv Detail & Related papers (2024-07-08T02:15:29Z) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.
WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z) - Revisiting Personalized Federated Learning: Robustness Against Backdoor
Attacks [53.81129518924231]
We conduct the first study of backdoor attacks in the pFL framework.
We show that pFL methods with partial model-sharing can significantly boost robustness against backdoor attacks.
We propose a lightweight defense method, Simple-Tuning, which empirically improves defense performance against backdoor attacks.
arXiv Detail & Related papers (2023-02-03T11:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.