Related papers: The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

URL: http://arxiv.org/abs/2502.20757v1
Date: Fri, 28 Feb 2025 06:18:50 GMT
Title: The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Authors: Yihong Tang, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Bo Wang, Jie Liu, Min Zhang,
Abstract summary: Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations.<n>It remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content.<n>We propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling.
Score: 29.974647411289826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model's ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.

Related papers

AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models [23.916663925674737]
Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked.<n>We propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis.<n>Our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics.
arXiv Detail & Related papers (2025-05-29T03:02:18Z)
SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator [77.86600052899156]
Large Language Model (LLM)-based agents are increasingly deployed in real-world applications.<n>We propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation.<n>We show that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks.
arXiv Detail & Related papers (2025-05-23T10:56:06Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models [37.104276926258095]
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data. We introduce textbfDREAM (textittextbfDisentangling textbfRisks to textbfEnhance Safety textbfAlignment in textbfMLLMs), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback.
arXiv Detail & Related papers (2025-04-25T03:54:24Z)
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs [73.09848497762667]
We conduct the first comprehensive assessment of role-play fine-tuning risks using RoleBench.<n>Experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance.<n>We propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety.
arXiv Detail & Related papers (2025-02-28T11:31:27Z)
Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts.<n>We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter.<n>We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
Active Learning for Robust and Representative LLM Generation in Safety-Critical Scenarios [32.16984263644299]
Large Language Models (LLMs) can generate valuable data for safety measures, but often exhibit distributional biases. We propose a novel framework that integrates active learning with clustering to guide LLM generation. Our results show that the proposed framework produces a more representative set of safety scenarios without requiring prior knowledge of the underlying data distribution.
arXiv Detail & Related papers (2024-10-14T21:48:14Z)
Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety. For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z)
Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints [0.0]
We introduce an innovative framework integrating diffusion models within the Multi-agent Reinforcement Learning paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action.
arXiv Detail & Related papers (2024-06-30T16:05:31Z)
Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models? [2.089112028396727]
This study explores whether Large Language Models can produce safe, unbiased outputs without sacrificing knowledge or comprehension.<n>We introduce the Safe and Responsible Large Language Model (textbfSR$_textLLM$)<n>Experiments on our specialized dataset and out-of-distribution test sets reveal that textbfSR$_textLLM$ effectively reduces biases while preserving knowledge integrity.
arXiv Detail & Related papers (2024-04-01T18:10:05Z)
Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis [63.532413807686524]
This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL) We propose a new architecture that handles the trade-off between efficient progress and safety during exploration.
arXiv Detail & Related papers (2023-12-18T16:09:43Z)
Empowering Autonomous Driving with Large Language Models: A Safety Perspective [82.90376711290808]
This paper explores the integration of Large Language Models (LLMs) into Autonomous Driving systems. LLMs are intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning. We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine.
arXiv Detail & Related papers (2023-11-28T03:13:09Z)
Modeling and mitigation of occupational safety risks in dynamic industrial environments [0.0]
This article proposes a method to enable continuous and quantitative assessment of safety risks in a data-driven manner. A fully Bayesian approach is developed to calibrate this model from safety data in an online fashion. The proposed model can be leveraged for automated decision making.
arXiv Detail & Related papers (2022-05-02T13:04:25Z)
SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics. We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations. We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.