Related papers: The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

URL: http://arxiv.org/abs/2505.00626v1
Date: Thu, 01 May 2025 16:06:16 GMT
Title: The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)
Authors: Zihao Wang, Yibo Jiang, Jiahao Yu, Heqing Huang,
Abstract summary: We show that fine-tuned large language models often rely on two proxies for role identification.<n>We propose reinforcing emphinvariant signals that mark role boundaries by adjusting token-wise cues in the model's input encoding.
Score: 15.48684126686974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role -- a concept we call \emph{role separation} -- is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine \emph{role-separation learning}: the process of teaching LLMs to robustly distinguish system and user tokens. Through a \emph{simple, controlled experimental framework}, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing \emph{invariant signals} that mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, manipulating position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.

Related papers

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z)
How role-play shapes relevance judgment in zero-shot LLM rankers [15.11127856890218]
Large Language Models (LLMs) have emerged as promising zero-shot rankers.<n>Their performance is highly sensitive to prompt formulation.<n>In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings.
arXiv Detail & Related papers (2025-10-20T13:39:48Z)
RAGs to Riches: RAG-like Few-shot Learning for Large Language Model Role-playing [26.263809408983306]
Large language models (LLMs) are increasingly deployed in high-stakes domains such as healthcare, education, and governance.<n>Inspired by Retrieval-Augmented Generation (RAG), we reformulate LLM role-playing into a text retrieval problem.
arXiv Detail & Related papers (2025-09-15T17:31:15Z)
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)
Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models [7.115323364355489]
In-context learning, the ability to adapt based on a few examples in the input prompt, is a ubiquitous feature of large language models (LLMs) We first show that Llama $3$ $70$B can solve simple RL problems in-context. We then analyze the residual stream of Llama using Sparse Autoencoders (SAEs) and find representations that closely match temporal difference (TD) errors.
arXiv Detail & Related papers (2024-10-02T06:51:12Z)
Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering [108.2131720470005]
Large language models (LLMs) have demonstrated remarkable performance across various real-world tasks. They often struggle to fully comprehend and effectively utilize their input contexts, resulting in responses that are unfaithful or hallucinated. We propose AutoPASTA, a method that automatically identifies key contextual information and explicitly highlights it by steering an LLM's attention scores.
arXiv Detail & Related papers (2024-09-16T23:52:41Z)
Get my drift? Catching LLM Task Drift with Activation Deltas [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.<n>We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.<n>We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs [63.29737699997859]
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation.
arXiv Detail & Related papers (2024-05-26T21:31:59Z)
Less is more: Summarizing Patch Tokens for efficient Multi-Label Class-Incremental Learning [38.36863497458095]
We propose a new class-incremental learning method Multi-Label class incremental learning via summarising pAtch tokeN Embeddings (MULTI-LANE) Our proposed method Multi-Label class incremental learning via summarising pAtch tokeN Embeddings (MULTI-LANE) enables learning disentangled task-specific representations in MLCIL while ensuring fast inference.
arXiv Detail & Related papers (2024-05-24T15:18:27Z)
Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z)
Prompting and Fine-Tuning Open-Sourced Large Language Models for Stance Classification [1.6317061277457001]
Stance classification has long been a focal point of research in domains ranging from social science to machine learning. Current stance detection methods rely predominantly on manual annotation of sentences, followed by training a supervised machine learning model. We investigate the use of Large Language Models as a stance detection methodology that can reduce or even eliminate the need for manual annotations.
arXiv Detail & Related papers (2023-09-24T19:36:17Z)
What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning [24.395288160951118]
Large language models (LLMs) exploit in-context learning (ICL) to solve tasks with only a few demonstrations. We characterize two ways through which ICL leverages demonstrations. We show that models can achieve non-trivial performance with only TR, and TR does not further improve with larger models or more demonstrations.
arXiv Detail & Related papers (2023-05-16T18:05:19Z)
RODE: Learning Roles to Decompose Multi-Agent Tasks [69.56458960841165]
Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles. We propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents. By virtue of these advances, our method outperforms the current state-of-the-art MARL algorithms on 10 of the 14 scenarios that comprise the challenging StarCraft II micromanagement benchmark.
arXiv Detail & Related papers (2020-10-04T09:20:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.