Related papers: Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

URL: http://arxiv.org/abs/2502.20968v1
Date: Fri, 28 Feb 2025 11:31:27 GMT
Title: Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Authors: Weixiang Zhao, Yulin Hu, Yang Deng, Jiahe Guo, Xingyu Sui, Xinyang Han, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu,
Abstract summary: We conduct the first comprehensive assessment of role-play fine-tuning risks using RoleBench.<n>Experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance.<n>We propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety.
Score: 73.09848497762667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.

Related papers

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z)
Improving LLM Reasoning through Interpretable Role-Playing Steering [23.75554062102392]
Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs)<n>We introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior.<n>Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model's residual stream with controllable intensity.
arXiv Detail & Related papers (2025-06-09T00:31:17Z)
The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents [29.974647411289826]
Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations.<n>It remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content.<n>We propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling.
arXiv Detail & Related papers (2025-02-28T06:18:50Z)
AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection [47.83354878065321]
We propose AGrail, a lifelong guardrail to enhance agent safety.<n>AGrail features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility.
arXiv Detail & Related papers (2025-02-17T05:12:33Z)
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.<n>Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts [0.0]
As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories.
arXiv Detail & Related papers (2024-04-09T03:54:28Z)
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training. Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z)
Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment [62.898963074989766]
We introduce Ditto, a self-alignment method for role-play. This method creates a role-play training set comprising 4,000 characters, surpassing the scale of currently available datasets by tenfold. We present the first comprehensive cross-supervision alignment experiment in the role-play domain.
arXiv Detail & Related papers (2024-01-23T03:56:22Z)
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents [28.0550468465181]
Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. This work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records.
arXiv Detail & Related papers (2024-01-18T14:40:46Z)
Evil Geniuses: Delving into the Safety of LLM-based Agents [35.49857256840015]
Large language models (LLMs) have revitalized in large language models (LLMs) This paper delves into the safety of LLM-based agents from three perspectives: agent quantity, role definition, and attack level.
arXiv Detail & Related papers (2023-11-20T15:50:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.