ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails
- URL: http://arxiv.org/abs/2502.13458v1
- Date: Wed, 19 Feb 2025 06:09:58 GMT
- Title: ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails
- Authors: Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, Muhao Chen,
- Abstract summary: ThinkGuard is a critique-augmented guardrail model that distills knowledge from high-capacity language models.
It achieves the highest average F1 and AUPRC, outperforming all baselines.
It surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning.
- Score: 33.96886111900147
- License:
- Abstract: Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.
Related papers
- Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [26.812138599896997]
We propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates safety reflections of queries and responses into LLM's generation process.
R2D effectively mitigates various attacks and improves overall safety, highlighting the substantial potential of safety-aware reasoning in strengthening LLMs' robustness against jailbreaks.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - GuardReasoner: Towards Reasoning-based LLM Safeguards [63.53800124080227]
This paper proposes GuardReasoner, a new safeguard for LLMs.
We first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps.
Then, we introduce reasoning SFT to unlock the reasoning capability of guard models.
In this manner, GuardReasoner achieves better performance, explainability, and generalizability.
arXiv Detail & Related papers (2025-01-30T17:06:06Z) - You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense [34.023473699165315]
We study the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies.
We find that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously.
arXiv Detail & Related papers (2025-01-21T15:24:29Z) - SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework.
Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.
It aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z) - Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.
However, ensuring the safety of LLMs during the fine-tuning remains a critical concern.
We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z) - PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing [1.474945380093949]
Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance.
Current methods struggle in balancing safety with helpfulness.
We propose PrimeGuard, a novel ITG method that utilizes structured control flow.
arXiv Detail & Related papers (2024-07-23T09:14:27Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - $R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning [8.408258504178718]
Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them.
We propose $R2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning.
$R2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.
arXiv Detail & Related papers (2024-07-08T02:15:29Z) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.
WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.