BeaverTails: Towards Improved Safety Alignment of LLM via a
Human-Preference Dataset
- URL: http://arxiv.org/abs/2307.04657v3
- Date: Tue, 7 Nov 2023 03:24:06 GMT
- Title: BeaverTails: Towards Improved Safety Alignment of LLM via a
Human-Preference Dataset
- Authors: Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian,
Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang
- Abstract summary: BeaverTails dataset is aimed at fostering research on safety alignment in large language models (LLMs)
In total, we have gathered safety meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert comparison data for both the helpfulness and harmlessness metrics.
- Score: 20.77753605374455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce the BeaverTails dataset, aimed at fostering
research on safety alignment in large language models (LLMs). This dataset
uniquely separates annotations of helpfulness and harmlessness for
question-answering pairs, thus offering distinct perspectives on these crucial
attributes. In total, we have gathered safety meta-labels for 333,963
question-answer (QA) pairs and 361,903 pairs of expert comparison data for both
the helpfulness and harmlessness metrics. We further showcase applications of
BeaverTails in content moderation and reinforcement learning with human
feedback (RLHF), emphasizing its potential for practical safety measures in
LLMs. We believe this dataset provides vital resources for the community,
contributing towards the safe development and deployment of LLMs. Our project
page is available at the following URL:
https://sites.google.com/view/pku-beavertails.
Related papers
- Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models [24.168387024091082]
Fine-tuning large language models (LLMs) based on human preferences has been effective in improving their performance.
Maintaining safety throughout the fine-tuning process remains a significant challenge.
We propose an Equilibrate RLHF framework that achieves better safety alignment even with fewer training data.
arXiv Detail & Related papers (2025-02-17T08:40:30Z) - Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset [4.522849055040843]
This study audited the Helpful and Harmless dataset by Anthropic.
Our findings highlight the need for more nuanced, context-sensitive approaches to safety mitigation in large language models.
arXiv Detail & Related papers (2024-11-12T23:43:20Z) - Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.
However, ensuring the safety of LLMs during the fine-tuning remains a critical concern.
We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference [9.883296844539839]
The PKU-SafeRLHF dataset is designed to promote research on safety alignment in large language models (LLMs)
Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe.
arXiv Detail & Related papers (2024-06-20T18:37:36Z) - AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts [0.0]
As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase.
We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas.
To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories.
arXiv Detail & Related papers (2024-04-09T03:54:28Z) - SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety [27.843894102000608]
We conduct a first systematic review of open datasets for evaluating and improving large language models (LLMs) safety.
We highlight trends, such as a trend towards fully synthetic datasets, as well as gaps in dataset coverage, such as a clear lack of non-English and naturalistic datasets.
Our contributions are based on SafetyPrompts.com, a living catalogue of open datasets for LLM safety.
arXiv Detail & Related papers (2024-04-08T10:57:25Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [90.73444232283371]
ShieldLM is a safety detector for Large Language Models (LLMs) that aligns with common safety standards.
We show that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability.
arXiv Detail & Related papers (2024-02-26T09:43:02Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.