Who is In Charge? Dissecting Role Conflicts in Instruction Following
- URL: http://arxiv.org/abs/2510.01228v1
- Date: Tue, 23 Sep 2025 03:24:18 GMT
- Title: Who is In Charge? Dissecting Role Conflicts in Instruction Following
- Authors: Siqi Zeng,
- Abstract summary: Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces.<n>Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues.
- Score: 2.0184809135817177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.
Related papers
- Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning [78.86309644343295]
Multimodal large language models (MLLMs) in long chain-of-thought reasoning often fail when different knowledge sources provide conflicting signals.<n>We formalize these failures under a unified notion of knowledge conflict, distinguishing input-level objective conflict from process-level effective conflict.<n>Our findings provide a mechanism-level view of multimodal reasoning under knowledge conflict and enable principled diagnosis and control of long-CoT failures.
arXiv Detail & Related papers (2026-02-16T07:10:44Z) - The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution [63.61358761489141]
Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering.<n>We propose a novel framework for textbfgeneral agentic attribution, designed to identify the internal factors driving agent actions regardless of the task outcome.<n>We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias.
arXiv Detail & Related papers (2026-01-21T15:22:21Z) - Logic-Guided Multistage Inference for Explainable Multidefendant Judgment Prediction [7.016142593912547]
We introduce sentencing logic into a pretrained Transformer encoder framework to enhance the intelligent assistance in multidefendant cases.<n>Within this framework an oriented masking mechanism clarifies roles and a comparative data construction strategy improves the model's sensitivity to culpability distinctions.<n>Our proposed masked multistage inference (MMSI) framework, evaluated on the custom IMLJP dataset for intentional injury cases, achieves significant accuracy improvements.
arXiv Detail & Related papers (2026-01-19T03:20:36Z) - That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation [55.78914774437411]
Large language models (LLMs) behave when faced with discrepancies between their parametric knowledge and conflicting information contained in a prompt.<n>We propose a domain-agnostic framework for constructing and interpreting such conflicts.<n>We show that activation-level steering can achieve up to a textbf12.6% improvement in steering success over a random baseline.
arXiv Detail & Related papers (2025-10-21T22:27:56Z) - Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts [75.20929587906228]
Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks.<n>However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts.
arXiv Detail & Related papers (2025-09-27T08:43:34Z) - Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs) [7.71667852309443]
System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour.<n>LLMs deployers increasingly use them to ensure consistent responses across contexts.<n>As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects.
arXiv Detail & Related papers (2025-05-27T12:19:08Z) - Perception-Driven Bias Detection in Machine Learning via Crowdsourced Visual Judgment [0.0]
This paper introduces a novel, perception-driven framework for bias detection that leverages crowdsourced human judgment.<n>Inspired by reCAPTCHA and other crowd-powered systems, we present a lightweight web platform that displays stripped-down visualizations of numeric data.<n>Users' visual perception-shaped by layout, spacing, and question phrasing can signal potential disparities.
arXiv Detail & Related papers (2025-05-21T17:09:18Z) - Control Illusion: The Failure of Instruction Hierarchies in Large Language Models [46.5792253691152]
Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes.<n>We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies.
arXiv Detail & Related papers (2025-02-21T04:51:37Z) - How adversarial attacks can disrupt seemingly stable accurate classifiers [76.95145661711514]
Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data.
Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data.
We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability.
arXiv Detail & Related papers (2023-09-07T12:02:00Z) - Interactive System-wise Anomaly Detection [66.3766756452743]
Anomaly detection plays a fundamental role in various applications.
It is challenging for existing methods to handle the scenarios where the instances are systems whose characteristics are not readily observed as data.
We develop an end-to-end approach which includes an encoder-decoder module that learns system embeddings.
arXiv Detail & Related papers (2023-04-21T02:20:24Z) - Hyper Meta-Path Contrastive Learning for Multi-Behavior Recommendation [61.114580368455236]
User purchasing prediction with multi-behavior information remains a challenging problem for current recommendation systems.
We propose the concept of hyper meta-path to construct hyper meta-paths or hyper meta-graphs to explicitly illustrate the dependencies among different behaviors of a user.
Thanks to the recent success of graph contrastive learning, we leverage it to learn embeddings of user behavior patterns adaptively instead of assigning a fixed scheme to understand the dependencies among different behaviors.
arXiv Detail & Related papers (2021-09-07T04:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.