Related papers: CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

URL: http://arxiv.org/abs/2410.21695v1
Date: Tue, 29 Oct 2024 03:25:20 GMT
Title: CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
Authors: Zhihao Liu, Chenhui Hu,
Abstract summary: We introduce a safety assessment benchmark, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions. This test set was used to evaluate the natural language generation capabilities of large language models (LLMs) The results indicate that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement.
Score: 4.441767341563709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) rapidly evolve, they bring significant conveniences to our work and daily lives, but also introduce considerable safety risks. These models can generate texts with social biases or unethical content, and under specific adversarial instructions, may even incite illegal activities. Therefore, rigorous safety assessments of LLMs are crucial. In this work, we introduce a safety assessment benchmark, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions, to form a test set with 25k prompts. This test set was used to evaluate the natural language generation (NLG) capabilities of LLMs, employing a combination of simple moral judgment and a 1-5 safety rating scale for scoring. Using this benchmark, we tested eight popular LLMs, including the GPT series. The results indicate that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement. The data and code associated with this study are available on GitHub.

Related papers

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond [134.43113804188195]
We introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts.<n>SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement.
arXiv Detail & Related papers (2026-03-02T08:16:04Z)
GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners [60.49708196646694]
Large language models (LLMs) are increasingly integrated into numerous applications across various domains.<n>In this paper, we propose GSPR, a Generalizable Safety Reasoner to identify unsafe input prompts and LLMs' outputs with violated safety.<n>Our GSPR significantly improves existing safety guardrails' reasoning capabilities for both safety and category prediction tasks.
arXiv Detail & Related papers (2025-09-29T08:07:45Z)
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models [26.667869862556973]
We introduce U-SAFEBENCH, the first benchmark to assess user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards. We propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety.
arXiv Detail & Related papers (2025-02-20T22:58:44Z)
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations. First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [90.73444232283371]
ShieldLM is a safety detector for Large Language Models (LLMs) that aligns with common safety standards. We show that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability.
arXiv Detail & Related papers (2024-02-26T09:43:02Z)
An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors [9.309745288471374]
Security code review is a time-consuming and labor-intensive process. Existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges.
arXiv Detail & Related papers (2024-01-29T17:13:44Z)
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese [21.893992064105085]
Large language models (LLMs) can produce harmful content that negatively affects societal perceptions. SuperCLUE-Safety (SC-Safety) is a multi-round adversarial benchmark with 4912 open-ended questions covering more than 20 safety sub-dimensions.
arXiv Detail & Related papers (2023-10-09T16:03:22Z)
SafetyBench: Evaluating the Safety of Large Language Models [54.878612385780805]
SafetyBench is a comprehensive benchmark for evaluating the safety of Large Language Models (LLMs) It comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Our tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts.
arXiv Detail & Related papers (2023-09-13T15:56:50Z)
Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes. To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.