Related papers: Reasoning Up the Instruction Ladder for Controllable Language Models

Reasoning Up the Instruction Ladder for Controllable Language Models

URL: http://arxiv.org/abs/2511.04694v2
Date: Wed, 12 Nov 2025 01:19:01 GMT
Title: Reasoning Up the Instruction Ladder for Controllable Language Models
Authors: Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar,
Abstract summary: Large language model (LLM) based systems take on high-stakes roles in real-world decision-making.<n>Enforcement of an instruction hierarchy (IH) in LLMs is critical for the reliability and controllability of LLMs.<n>In this work, we reframe instruction hierarchy resolution as a reasoning task.
Score: 26.068755167791505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

Related papers

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models [46.5792253691152]
Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes.<n>We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies.
arXiv Detail & Related papers (2025-02-21T04:51:37Z)
IHEval: Evaluating Language Models on Following the Instruction Hierarchy [67.33509094445104]
The instruction hierarchy establishes a priority order from system messages to user messages, conversation history, and tool outputs.<n>Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy.<n>We bridge this gap by introducing IHEval, a novel benchmark covering cases where instructions in different priorities either align or conflict.
arXiv Detail & Related papers (2025-02-12T19:35:28Z)
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats.<n>One major cause of these vulnerabilities is the lack of an instruction hierarchy.<n>We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z)
RNR: Teaching Large Language Models to Follow Roles and Rules [153.6596303205894]
We propose model, an automated data generation pipeline that generates diverse roles and rules from existing IFT instructions. This data can then be used to train models that follow complex system prompts. Our framework significantly improves role and rule following capability in large language models.
arXiv Detail & Related papers (2024-09-10T06:07:32Z)
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions [21.76697662025996]
LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. We propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.
arXiv Detail & Related papers (2024-04-19T22:55:23Z)
Nevermind: Instruction Override and Moderation in Large Language Models [2.0935496890864207]
We investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations. We observe improving instruction following, and subsequently instruction overrides/jailbreaks, is fundamentally at odds with the ability of a language model to follow given safety filters or guidelines.
arXiv Detail & Related papers (2024-02-05T18:58:19Z)
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z)
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following. This capability brings with it the risk of prompt injection attacks. We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.