RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems
- URL: http://arxiv.org/abs/2409.16727v1
- Date: Wed, 25 Sep 2024 08:23:46 GMT
- Title: RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems
- Authors: Yihong Tang, Bo Wang, Xu Wang, Dongming Zhao, Jing Liu, Jijun Zhang, Ruifang He, Yuexian Hou,
- Abstract summary: Role-playing systems powered by large language models (LLMs) have become increasingly influential in emotional communication applications.
These systems are susceptible to character hallucinations, where the model deviates from predefined character roles and generates responses that are inconsistent with the intended persona.
This paper presents the first systematic analysis of character hallucination from an attack perspective, introducing the RoleBreak framework.
- Score: 20.786294377706717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Role-playing systems powered by large language models (LLMs) have become increasingly influential in emotional communication applications. However, these systems are susceptible to character hallucinations, where the model deviates from predefined character roles and generates responses that are inconsistent with the intended persona. This paper presents the first systematic analysis of character hallucination from an attack perspective, introducing the RoleBreak framework. Our framework identifies two core mechanisms-query sparsity and role-query conflict-as key factors driving character hallucination. Leveraging these insights, we construct a novel dataset, RoleBreakEval, to evaluate existing hallucination mitigation techniques. Our experiments reveal that even enhanced models trained to minimize hallucination remain vulnerable to attacks. To address these vulnerabilities, we propose a novel defence strategy, the Narrator Mode, which generates supplemental context through narration to mitigate role-query conflicts and improve query generalization. Experimental results demonstrate that Narrator Mode significantly outperforms traditional refusal-based strategies by reducing hallucinations, enhancing fidelity to character roles and queries, and improving overall narrative coherence.
Related papers
- MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.
It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.
We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z) - Towards Enhanced Immersion and Agency for LLM-based Interactive Drama [55.770617779283064]
This paper begins with understanding interactive drama from two aspects: Immersion, the player's feeling of being present in the story, and Agency.
To enhance these two aspects, we first propose Playwriting-guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality.
arXiv Detail & Related papers (2025-02-25T06:06:16Z) - Eliciting Language Model Behaviors with Investigator Agents [93.34072434845162]
Language models exhibit complex, diverse behaviors when prompted with free-form text.
We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors.
We train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them.
arXiv Detail & Related papers (2025-02-03T10:52:44Z) - CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds [74.02480671181685]
Role-playing is a crucial capability of Large Language Models (LLMs)
Current evaluation methods fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing.
We propose CharacterBox, a simulation sandbox designed to generate situational fine-grained character behavior trajectories.
arXiv Detail & Related papers (2024-12-07T12:09:35Z) - SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents [12.990119925990477]
We propose a generalizable, explicit and effective paradigm to unlock the interactive patterns in diverse worldviews.
Specifically, we define the interactive hallucination based on stance transfer and construct a benchmark, SHARP, by extracting relations from a general commonsense knowledge graph.
Our findings explore the factors influencing these metrics and discuss the trade-off between blind loyalty to roles and adherence to facts in RPAs.
arXiv Detail & Related papers (2024-11-12T17:41:16Z) - Mitigating Hallucination in Fictional Character Role-Play [19.705708068900076]
We focus on the evaluation and mitigation of hallucination in fictional character role-play.
We introduce a dataset with over 2,000 characters and 72,000 interviews, including 18,000 adversarial questions.
We propose RoleFact, a role-playing method that mitigates hallucination by modulating the influence of parametric knowledge.
arXiv Detail & Related papers (2024-06-25T03:56:33Z) - TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models [55.51648393234699]
We introduce TimeChara, a new benchmark designed to evaluate point-in-time character hallucination in role-playing LLMs.
We propose Narrative-Experts, a method that decomposes the reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations effectively.
arXiv Detail & Related papers (2024-05-28T10:19:18Z) - A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue Generation [51.53917938874146]
We propose a possible solution for alleviating the hallucination in KGD by exploiting the dialogue-knowledge interaction.
Experimental results of our example implementation show that this method can reduce hallucination without disrupting other dialogue performance.
arXiv Detail & Related papers (2024-04-04T14:45:26Z) - Affective and Dynamic Beam Search for Story Generation [50.3130767805383]
We propose Affective Story Generator (AffGen) for generating interesting narratives.
AffGen employs two novel techniques-Dynamic Beam Sizing and Affective Reranking.
arXiv Detail & Related papers (2023-10-23T16:37:14Z) - Investigating Human-Identifiable Features Hidden in Adversarial
Perturbations [54.39726653562144]
Our study explores up to five attack algorithms across three datasets.
We identify human-identifiable features in adversarial perturbations.
Using pixel-level annotations, we extract such features and demonstrate their ability to compromise target models.
arXiv Detail & Related papers (2023-09-28T22:31:29Z) - Conflicts, Villains, Resolutions: Towards models of Narrative Media
Framing [19.589945994234075]
We revisit a widely used conceptualization of framing from the communication sciences which explicitly captures elements of narratives.
We adapt an effective annotation paradigm that breaks a complex annotation task into a series of simpler binary questions.
We explore automatic multi-label prediction of our frames with supervised and semi-supervised approaches.
arXiv Detail & Related papers (2023-06-03T08:50:13Z) - M-SENSE: Modeling Narrative Structure in Short Personal Narratives Using
Protagonist's Mental Representations [14.64546899992196]
We propose the task of automatically detecting prominent elements of the narrative structure by analyzing the role of characters' inferred mental state.
We introduce a STORIES dataset of short personal narratives containing manual annotations of key elements of narrative structure, specifically climax and resolution.
Our model is able to achieve significant improvements in the task of identifying climax and resolution.
arXiv Detail & Related papers (2023-02-18T20:48:02Z) - Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples.
Yet, its passive nature inevitably prevents it from being immune to unknown attackers.
We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z) - Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News [57.9843300852526]
We introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions.
To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles.
In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visual-semantic inconsistencies.
arXiv Detail & Related papers (2020-09-16T14:13:15Z) - Once Upon A Time In Visualization: Understanding the Use of Textual
Narratives for Causality [21.67542584041709]
Causality visualization can help people understand temporal chains of events.
But as the scale and complexity of these event sequences grows, even these visualizations can become overwhelming to use.
We propose the use of textual narratives as a data-driven storytelling method to augment causality visualization.
arXiv Detail & Related papers (2020-09-06T05:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.