Related papers: Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

URL: http://arxiv.org/abs/2510.14351v2
Date: Sat, 18 Oct 2025 07:29:23 GMT
Title: Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts
Authors: Perapard Ngokpol, Kun Kerdthaisong, Pasin Buakhaw, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot,
Abstract summary: We introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions.<n>We score responses for canonical accuracy and reasoning fidelity.<n>We propose Think-Act Matching, a metric that quantifies alignment between reasons and actions.
Score: 2.2816872489992135
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters -- for example, superheroes across comic and cinematic universes -- remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation ("thinking") from outward decisions ("acting"). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.

Related papers

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning [24.808926786222376]
We present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem.<n>Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space.<n>Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs.
arXiv Detail & Related papers (2026-01-29T13:43:17Z)
Computational Representations of Character Significance in Novels [10.538161193756666]
We present a new literary theory proposing a six-component structural model of character.<n>This model accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters.<n>We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens.
arXiv Detail & Related papers (2026-01-21T22:29:41Z)
Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation [62.54606886226136]
Procedural content generation has enabled vast virtual worlds through levels, maps, and quests, but large-scale character generation remains underexplored.<n>We identify two alignment-induced biases in existing methods.<n>We introduce PersonaWeaver, a framework that disentangles world-building from behavioral-building.
arXiv Detail & Related papers (2026-01-06T20:18:01Z)
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains [69.0500092126915]
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters.<n>We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters.<n>We introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation.<n>Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases.
arXiv Detail & Related papers (2025-11-07T03:50:52Z)
MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables [50.29407048003165]
We present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature.<n>The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering.<n>Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning.
arXiv Detail & Related papers (2025-09-15T19:06:10Z)
MotiveBench: How Far Are We From Human-Like Motivational Reasoning in Large Language Models? [43.58975298601617]
MotiveBench consists of 200 rich contextual scenarios and 600 reasoning tasks covering multiple levels of motivation.<n>We conduct experiments on seven popular model families, comparing different scales and versions within each family.<n>The results show that even the most advanced LLMs still fall short in achieving human-like motivational reasoning.
arXiv Detail & Related papers (2025-06-16T03:18:28Z)
Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding [97.05584099530226]
We introduce MF$2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies.<n>For each pair, models must correctly identify both the true and false claims.<n>Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance.
arXiv Detail & Related papers (2025-06-06T17:58:36Z)
Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents [48.52216655094884]
Internal thinking processes of role-playing language agents (RPLAs) remain unexplored.<n>We introduce ROLETHINK, a novel benchmark constructed from literature for evaluating character thought generation.<n>We propose MIRROR, a chain-of-thought approach that generates character thoughts by retrieving memories, predicting character reactions, and synthesizing motivations.
arXiv Detail & Related papers (2025-03-11T08:57:07Z)
CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds [74.02480671181685]
Role-playing is a crucial capability of Large Language Models (LLMs)<n>Current evaluation methods fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing.<n>We propose CharacterBox, a simulation sandbox designed to generate situational fine-grained character behavior trajectories.
arXiv Detail & Related papers (2024-12-07T12:09:35Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
How FaR Are Large Language Models From Agents with Theory-of-Mind? [69.41586417697732]
We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D) T4D requires models to connect inferences about others' mental states to actions in social scenarios. We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
arXiv Detail & Related papers (2023-10-04T06:47:58Z)
Better Zero-Shot Reasoning with Role-Play Prompting [10.90357246745529]
Role-play prompting consistently surpasses the standard zero-shot approach across most datasets. This highlights its potential to augment the reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-08-15T11:08:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.