Related papers: GuardReasoner: Towards Reasoning-based LLM Safeguards

GuardReasoner: Towards Reasoning-based LLM Safeguards

URL: http://arxiv.org/abs/2501.18492v1
Date: Thu, 30 Jan 2025 17:06:06 GMT
Title: GuardReasoner: Towards Reasoning-based LLM Safeguards
Authors: Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi,
Abstract summary: This paper proposes GuardReasoner, a new safeguard for LLMs.<n>We first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps.<n>Then, we introduce reasoning SFT to unlock the reasoning capability of guard models.<n>In this manner, GuardReasoner achieves better performance, explainability, and generalizability.
Score: 63.53800124080227
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.

Related papers

Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem [53.3188041952701]
We show that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs.<n>With just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks.<n>Results are comparable to or even surpass the results from RL with 20x less compute.
arXiv Detail & Related papers (2025-06-03T18:35:52Z)
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study [90.34190170330481]
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming.<n>However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance.<n>We present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning.
arXiv Detail & Related papers (2025-05-21T11:45:29Z)
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning [43.89818154399979]
This paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL.<n>We construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs.<n>To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost.
arXiv Detail & Related papers (2025-05-16T09:46:10Z)
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data [33.51888940162213]
STAR-1 is a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs.
arXiv Detail & Related papers (2025-04-02T17:04:04Z)
DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs [3.850766603072179]
We introduce Don't Reason Bench (DNR Bench) to evaluate large language models (LLMs) DNR Bench consists of 150 adversarially designed prompts that are easy for humans to understand and respond to. Our experiments reveal that RLMs generate up to 70x more tokens than necessary, often failing at tasks that simpler non-reasoning models handle efficiently with higher accuracy.
arXiv Detail & Related papers (2025-03-20T02:19:14Z)
ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [33.96886111900147]
ThinkGuard is a critique-augmented guardrail model that distills knowledge from high-capacity language models. It achieves the highest average F1 and AUPRC, outperforming all baselines. It surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning.
arXiv Detail & Related papers (2025-02-19T06:09:58Z)
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [53.84130385074551]
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA) With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
arXiv Detail & Related papers (2025-02-11T08:48:48Z)
Responsible AI in Construction Safety: Systematic Evaluation of Large Language Models and Prompt Engineering [9.559203170987598]
Construction remains one of the most hazardous sectors. Recent advancements in AI, particularly Large Language Models (LLMs), offer promising opportunities for enhancing workplace safety. This study evaluates the performance of two widely used LLMs, GPT-3.5 and GPT-4o, across three standardized exams administered by the Board of Certified Safety Professionals (BCSP)
arXiv Detail & Related papers (2024-11-13T04:06:09Z)
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning [8.408258504178718]
Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. We propose $R2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. $R2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.
arXiv Detail & Related papers (2024-07-08T02:15:29Z)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety. WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)
Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks [53.81129518924231]
We conduct the first study of backdoor attacks in the pFL framework. We show that pFL methods with partial model-sharing can significantly boost robustness against backdoor attacks. We propose a lightweight defense method, Simple-Tuning, which empirically improves defense performance against backdoor attacks.
arXiv Detail & Related papers (2023-02-03T11:58:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.