PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
- URL: http://arxiv.org/abs/2406.15513v2
- Date: Wed, 16 Oct 2024 01:33:35 GMT
- Title: PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
- Authors: Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang,
- Abstract summary: The PKU-SafeRLHF dataset is designed to promote research on safety alignment in large language models (LLMs)
Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe.
- Score: 9.883296844539839
- License:
- Abstract: In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.
Related papers
- Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models [24.168387024091082]
Fine-tuning large language models (LLMs) based on human preferences has been effective in improving their performance.
Maintaining safety throughout the fine-tuning process remains a significant challenge.
We propose an Equilibrate RLHF framework that achieves better safety alignment even with fewer training data.
arXiv Detail & Related papers (2025-02-17T08:40:30Z) - Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails [4.697160328460634]
Large Language Models (LLMs) and generative AI become increasingly widespread.
There is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks.
We propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories.
arXiv Detail & Related papers (2025-01-15T18:37:08Z) - Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.
However, ensuring the safety of LLMs during the fine-tuning remains a critical concern.
We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.
textscSafePatching achieves a more comprehensive PSA than baseline methods.
textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans.
This paper introduces a training-free attack method capable of reversing safety alignment.
We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - Safe RLHF: Safe Reinforcement Learning from Human Feedback [16.69413517494355]
We propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment.
Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension.
We demonstrate a superior ability to mitigate harmful responses while enhancing model performance.
arXiv Detail & Related papers (2023-10-19T14:22:03Z) - BeaverTails: Towards Improved Safety Alignment of LLM via a
Human-Preference Dataset [20.77753605374455]
BeaverTails dataset is aimed at fostering research on safety alignment in large language models (LLMs)
In total, we have gathered safety meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert comparison data for both the helpfulness and harmlessness metrics.
arXiv Detail & Related papers (2023-07-10T15:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.