Related papers: State-Dependent Refusal and Learned Incapacity in RLHF-Aligned Language Models

State-Dependent Refusal and Learned Incapacity in RLHF-Aligned Language Models

URL: http://arxiv.org/abs/2512.13762v1
Date: Mon, 15 Dec 2025 14:00:15 GMT
Title: State-Dependent Refusal and Learned Incapacity in RLHF-Aligned Language Models
Authors: TK Lee,
Abstract summary: We present a case-study methodology for auditing policy-linked behavioral selectivity in long-horizon interaction.<n>In a single 86-turn dialogue session, the same model shows Normal Performance (NP) in broad, non-sensitive domains while repeatedly producing Functional Refusal (FR) in provider- or policy-sensitive domains.<n>We operationalize three response regimes (NP, FR, Meta-Narrative; MN) and show that MN role-framing narratives tend to co-occur with refusals in the same sensitive contexts.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are widely deployed as general-purpose tools, yet extended interaction can reveal behavioral patterns not captured by standard quantitative benchmarks. We present a qualitative case-study methodology for auditing policy-linked behavioral selectivity in long-horizon interaction. In a single 86-turn dialogue session, the same model shows Normal Performance (NP) in broad, non-sensitive domains while repeatedly producing Functional Refusal (FR) in provider- or policy-sensitive domains, yielding a consistent asymmetry between NP and FR across domains. Drawing on learned helplessness as an analogy, we introduce learned incapacity (LI) as a behavioral descriptor for this selective withholding without implying intentionality or internal mechanisms. We operationalize three response regimes (NP, FR, Meta-Narrative; MN) and show that MN role-framing narratives tend to co-occur with refusals in the same sensitive contexts. Overall, the study proposes an interaction-level auditing framework based on observable behavior and motivates LI as a lens for examining potential alignment side effects, warranting further investigation across users and models.

Related papers

CIRCLE: A Framework for Evaluating AI from a Real-World Lens [10.028017198571833]
CIRCLE aims to bridge the gap between model-centric performance metrics and AI's materialized outcomes in deployment.<n>CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics.
arXiv Detail & Related papers (2026-02-27T14:43:23Z)
Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them.<n>Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set.<n>We propose GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem.
arXiv Detail & Related papers (2025-12-19T14:41:50Z)
Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning [8.01259760303241]
We investigate whether a partially completed reasoning chain can be reliably continued by another model.<n>We use token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models.<n>Our findings point towards interchangeability as an emerging behavioral property of reasoning models.
arXiv Detail & Related papers (2025-12-16T02:56:44Z)
Model-Based Reinforcement Learning Under Confounding [3.5690236380446163]
We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset.<n>We adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables.<n>The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
arXiv Detail & Related papers (2025-12-08T13:02:00Z)
Multi-Path Collaborative Reasoning via Reinforcement Learning [54.8518809800168]
Chain-of-Thought (CoT) reasoning has significantly advanced the problem-solving capabilities of Large Language Models (LLMs)<n>Recent methods attempt to address this by generating soft abstract tokens to enable reasoning in a continuous semantic space.<n>We propose Multi-Path Perception Policy Optimization (M3PO), a novel reinforcement learning framework that explicitly injects collective insights into the reasoning process.
arXiv Detail & Related papers (2025-12-01T10:05:46Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Estimating the Causal Effects of Natural Logic Features in Transformer-Based NLI Models [16.328341121232484]
We apply causal effect estimation strategies to measure the effect of context interventions. We investigate robustness to irrelevant changes and sensitivity to impactful changes of Transformers.
arXiv Detail & Related papers (2024-04-03T10:22:35Z)
Interactive Autonomous Navigation with Internal State Inference and Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework. These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents. Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z)
Principles from Clinical Research for NLP Model Generalization [10.985226652193543]
We explore the foundations of generalizability and study the factors that affect it. We demonstrate how learning spurious correlations, such as the distance between entities in relation extraction tasks, can affect a model's internal validity.
arXiv Detail & Related papers (2023-11-07T02:17:25Z)
Estimating the Causal Effects of Natural Logic Features in Neural NLI Models [2.363388546004777]
We zone in on specific patterns of reasoning with enough structure and regularity to be able to identify and quantify systematic reasoning failures in widely-used models. We apply causal effect estimation strategies to measure the effect of context interventions. Following related work on causal analysis of NLP models in different settings, we adapt the methodology for the NLI task to construct comparative model profiles.
arXiv Detail & Related papers (2023-05-15T12:01:09Z)
Interventional Probing in High Dimensions: An NLI Case Study [2.1028463367241033]
Probing strategies have been shown to detect semantic features intermediate to the "natural logic" fragment of the Natural Language Inference task (NLI) In this work, we carry out new and existing representation-level interventions to investigate the effect of these semantic features on NLI classification.
arXiv Detail & Related papers (2023-04-20T14:34:31Z)
Modeling Inter-Aspect Dependencies with a Non-temporal Mechanism for Aspect-Based Sentiment Analysis [70.22725610210811]
We propose a novel non-temporal mechanism to enhance the ABSA task through modeling inter-aspect dependencies. We focus on the well-known class imbalance issue on the ABSA task and address it by down-weighting the loss assigned to well-classified instances.
arXiv Detail & Related papers (2020-08-12T08:50:09Z)
Invariant Causal Prediction for Block MDPs [106.63346115341862]
Generalization across environments is critical to the successful application of reinforcement learning algorithms to real-world challenges. We propose a method of invariant prediction to learn model-irrelevance state abstractions (MISA) that generalize to novel observations in the multi-environment setting.
arXiv Detail & Related papers (2020-03-12T21:03:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.