Related papers: ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models

ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models

URL: http://arxiv.org/abs/2507.00026v1
Date: Tue, 17 Jun 2025 10:55:17 GMT
Title: ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models
Authors: Jiale Ding, Xiang Zheng, Cong Wang, Wei-Bin Lee, Xingjun Ma, Yu-Gang Jiang,
Abstract summary: Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
Score: 60.28667314609623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications, evaluating their safety-especially under adversarial prompting-has become critical. Arguably, effective safety evaluations should be adaptive, evolving with LLM capabilities, and also cover a broad spectrum of harmful topics and real-world scenarios to fully expose potential vulnerabilities. Existing manual safety benchmarks, built on handcrafted adversarial prompts, are limited by their static nature and the intensive labor required to update them, making it difficult to keep pace with rapidly advancing LLMs. In contrast, automated adversarial prompt generation offers a promising path toward adaptive evaluation. However, current methods often suffer from insufficient adversarial topic coverage (topic-level diversity) and weak alignment with real-world contexts. These shortcomings stem from the exploration-exploitation dilemma in black-box optimization and a lack of real-world contextualization, resulting in adversarial prompts that are both topically narrow and scenario-repetitive. To address these issues, we propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM for generating topically diverse and contextually rich adversarial prompts. Experiments show that ROSE outperforms existing methods in uncovering safety vulnerabilities in state-of-the-art LLMs, with notable improvements in integrated evaluation metrics. We hope ROSE represents a step toward more practical and reality-oriented safety evaluation of LLMs. WARNING: This paper contains examples of potentially harmful text.

Related papers

$\texttt{SAGE}$: A Generic Framework for LLM Safety Evaluation [9.935219917903858]
This paper introduces the $texttSAGE$ (Safety AI Generic Evaluation) framework.<n>$texttSAGE$ is an automated modular framework designed for customized and dynamic harm evaluations.<n>Our experiments with multi-turn conversational evaluations revealed a concerning finding that harm steadily increases with conversation length.
arXiv Detail & Related papers (2025-04-28T11:01:08Z)
Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories [14.605576275135522]
evaluating value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts.<n>We propose an upgraded value alignment benchmark that moves beyond single-sentence prompts by incorporating multi-turn dialogues and narrative-based scenarios.
arXiv Detail & Related papers (2025-03-28T03:31:37Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z)
garak: A Framework for Security Probing Large Language Models [16.305837349514505]
garak is a framework which can be used to discover and identify vulnerabilities in a target Large Language Models (LLMs) The outputs of the framework describe a target model's weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts.
arXiv Detail & Related papers (2024-06-16T18:18:43Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs) It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z)
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes. To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.