Related papers: STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

URL: http://arxiv.org/abs/2504.01903v1
Date: Wed, 02 Apr 2025 17:04:04 GMT
Title: STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
Authors: Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie,
Abstract summary: STAR-1 is a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs)<n>Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs.
Score: 33.51888940162213
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.

Related papers

The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models. We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z)
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [21.317245896641136]
Long chain-of-thought (CoT) reasoning generates structured intermediate steps, enhancing reasoning capabilities.<n>Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs.
arXiv Detail & Related papers (2025-02-17T16:57:56Z)
GuardReasoner: Towards Reasoning-based LLM Safeguards [63.53800124080227]
This paper proposes GuardReasoner, a new safeguard for LLMs.<n>We first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps.<n>Then, we introduce reasoning SFT to unlock the reasoning capability of guard models.<n>In this manner, GuardReasoner achieves better performance, explainability, and generalizability.
arXiv Detail & Related papers (2025-01-30T17:06:06Z)
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs [4.441767341563709]
We introduce a safety assessment benchmark, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions. This test set was used to evaluate the natural language generation capabilities of large language models (LLMs) The results indicate that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement.
arXiv Detail & Related papers (2024-10-29T03:25:20Z)
LLMSecCode: Evaluating Large Language Models for Secure Coding [0.24999074238880484]
This work aims to improve the selection process of Large Language Models (LLMs) that are suitable for facilitating Secure Coding (SC) We introduce LLMSecCode, an open-source evaluation framework designed to assess SC capabilities objectively.
arXiv Detail & Related papers (2024-08-28T19:07:08Z)
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference [29.55937864144965]
This study is the first to study safety in multi-turn dialogue coreference in large language models (LLMs) We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. The highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model.
arXiv Detail & Related papers (2024-06-25T15:13:02Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training. Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.