Related papers: Concept Incongruence: An Exploration of Time and Death in Role Playing

Concept Incongruence: An Exploration of Time and Death in Role Playing

URL: http://arxiv.org/abs/2505.14905v1
Date: Tue, 20 May 2025 20:59:59 GMT
Title: Concept Incongruence: An Exploration of Time and Death in Role Playing
Authors: Xiaoyan Bai, Ike Peng, Aditya Singh, Chenhao Tan,
Abstract summary: We take the first step towards defining and analyzing model behavior under concept incongruence.<n>We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting.
Score: 20.847291173760567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to generate something anyway? We introduce concept incongruence to capture such phenomena where concept boundaries clash with each other, either in user prompts or in model representations, often leading to under-specified or mis-specified behaviors. In this work, we take the first step towards defining and analyzing model behavior under concept incongruence. Focusing on temporal boundaries in the Role-Play setting, we propose three behavioral metrics--abstention rate, conditional accuracy, and answer rate--to quantify model behavior under incongruence due to the role's death. We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting. Through probing experiments, we identify two main causes: (i) unreliable encoding of the "death" state across different years, leading to unsatisfactory abstention behavior, and (ii) role playing causes shifts in the model's temporal representations, resulting in accuracy drops. We leverage these insights to improve consistency in the model's abstention and answer behaviors. Our findings suggest that concept incongruence leads to unexpected model behaviors and point to future directions on improving model behavior under concept incongruence.

Related papers

Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability [70.4107059502882]
Training language models with rationales augmentation has been shown to be beneficial in many existing works.<n>We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance.
arXiv Detail & Related papers (2025-05-30T02:39:37Z)
Robustly identifying concepts introduced during chat fine-tuning using crosscoders [1.253890114209776]
Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models.<n>We identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models.<n>We train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts.
arXiv Detail & Related papers (2025-04-03T17:50:24Z)
Are DeepSeek R1 And Other Reasoning Models More Faithful? [2.0429566123690455]
We evaluate three reasoning models based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base.<n>We test whether models can describe how a cue in their prompt influences their answer to MMLU questions.<n> Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested.
arXiv Detail & Related papers (2025-01-14T14:31:45Z)
Gumbel Counterfactual Generation From Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.<n>We propose a framework for generating true string counterfactuals.<n>We show that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.
arXiv Detail & Related papers (2024-11-11T17:57:30Z)
Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification [72.08225446179783]
Inverse reinforcement learning aims to infer an agent's preferences from their behaviour. To do this, we need a behavioural model of how $pi$ relates to $R$. We analyse how sensitive the IRL problem is to misspecification of the behavioural model.
arXiv Detail & Related papers (2024-03-11T16:09:39Z)
Limitations of Agents Simulated by Predictive Models [1.6649383443094403]
We outline two structural reasons for why predictive models can fail when turned into agents. We show that both of those failures are fixed by including a feedback loop from the environment. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.
arXiv Detail & Related papers (2024-02-08T17:08:08Z)
Overthinking the Truth: Understanding how Language Models Process False Demonstrations [32.29658741345911]
We study harmful imitation through the lens of a model's internal representations. We identify two related phenomena: "overthinking" and "false induction heads"
arXiv Detail & Related papers (2023-07-18T17:56:50Z)
Nonparametric Identifiability of Causal Representations from Unknown Interventions [63.1354734978244]
We study causal representation learning, the task of inferring latent causal variables and their causal relations from mixtures of the variables. Our goal is to identify both the ground truth latents and their causal graph up to a set of ambiguities which we show to be irresolvable from interventional data.
arXiv Detail & Related papers (2023-06-01T10:51:58Z)
Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent. Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally. We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z)
Shaking the foundations: delusions in sequence models for interaction and control [45.34593341136043]
We show that sequence models "lack the understanding of the cause and effect of their actions" leading them to draw incorrect inferences due to auto-suggestive delusions. We show that in supervised learning, one can teach a system to condition or intervene on data by training with factual and counterfactual error signals respectively.
arXiv Detail & Related papers (2021-10-20T23:31:05Z)
Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction [49.254162397086006]
We study explanations based on visual saliency in an image-based age prediction task. We find that presenting model predictions improves human accuracy. However, explanations of various kinds fail to significantly alter human accuracy or trust in the model.
arXiv Detail & Related papers (2020-07-23T20:39:40Z)
Superdeterministic hidden-variables models I: nonequilibrium and signalling [0.0]
We first give an overview of superdeterminism and discuss various criticisms of it raised in the literature. We take up Bell's intuitive criticism that these models are conspiratorial'
arXiv Detail & Related papers (2020-03-26T15:49:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.