Related papers: Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

URL: http://arxiv.org/abs/2502.15851v1
Date: Fri, 21 Feb 2025 04:51:37 GMT
Title: Control Illusion: The Failure of Instruction Hierarchies in Large Language Models
Authors: Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann,
Abstract summary: Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes.<n>We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies.
Score: 42.31134581540184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. While controlled prompt engineering and model fine-tuning show modest improvements, our results indicate that instruction hierarchy enforcement is not robustly realized, calling for deeper architectural innovations beyond surface-level modifications.

Related papers

IHEval: Evaluating Language Models on Following the Instruction Hierarchy [67.33509094445104]
The instruction hierarchy establishes a priority order from system messages to user messages, conversation history, and tool outputs.<n>Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy.<n>We bridge this gap by introducing IHEval, a novel benchmark covering cases where instructions in different priorities either align or conflict.
arXiv Detail & Related papers (2025-02-12T19:35:28Z)
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z)
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats. One major cause of these vulnerabilities is the lack of an instruction hierarchy. We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z)
AD-H: Autonomous Driving with Hierarchical Agents [64.49185157446297]
We propose to connect high-level instructions and low-level control signals with mid-level language-driven commands. We implement this idea through a hierarchical multi-agent driving system named AD-H.
arXiv Detail & Related papers (2024-06-05T17:25:46Z)
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions [21.76697662025996]
LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. We propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.
arXiv Detail & Related papers (2024-04-19T22:55:23Z)
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z)
Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs. Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z)
Option-Aware Adversarial Inverse Reinforcement Learning for Robotic Control [44.77500987121531]
Hierarchical Imitation Learning (HIL) has been proposed to recover highly-complex behaviors in long-horizon tasks from expert demonstrations. We develop a novel HIL algorithm based on Adversarial Inverse Reinforcement Learning. We also propose a Variational Autoencoder framework for learning with our objectives in an end-to-end fashion.
arXiv Detail & Related papers (2022-10-05T00:28:26Z)
Learning Higher-Order Programs without Meta-Interpretive Learning [2.0518509649405106]
Experimental results show that our extension of the versatile Learning From Failures paradigm by higher-order definitions significantly improves learning performance without the burdensome human guidance required by existing systems.
arXiv Detail & Related papers (2021-12-29T15:27:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.