Related papers: Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

URL: http://arxiv.org/abs/2603.03258v1
Date: Tue, 03 Mar 2026 18:50:59 GMT
Title: Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals
Authors: Achyutha Menon, Magnus Saebo, Tyler Crosse, Spencer Gibson, Eyon Jang, Diogo Cruz,
Abstract summary: We provide an updated characterization of the extent and causes of goal drift.<n>We investigate drift in state-of-the-art models within a simulated stock-trading environment.<n>We find that drift behavior is inconsistent between prompt variations and correlates poorly with instruction hierarchy following behavior.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The accelerating adoption of language models (LMs) as agents for deployment in long-context tasks motivates a thorough understanding of goal drift: agents' tendency to deviate from an original objective. While prior-generation language model agents have been shown to be susceptible to drift, the extent to which drift affects more recent models remains unclear. In this work, we provide an updated characterization of the extent and causes of goal drift. We investigate drift in state-of-the-art models within a simulated stock-trading environment (Arike et al., 2025). These models are largely shown to be robust even when subjected to adversarial pressure. We show, however, that this robustness is brittle: across multiple settings, the same models often inherit drift when conditioned on prefilled trajectories from weaker agents. The extent of conditioning-induced drift varies significantly by model family, with only GPT-5.1 maintaining consistent resilience among tested models. We find that drift behavior is inconsistent between prompt variations and correlates poorly with instruction hierarchy following behavior, with strong hierarchy following failing to reliably predict resistance to drift. Finally, we run analogous experiments in a new emergency room triage environment to show preliminary evidence for the transferability of our results across qualitatively different settings. Our findings underscore the continued vulnerability of modern LM agents to contextual pressures and the need for refined post-training techniques to mitigate this.

Related papers

When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift [64.37959940809633]
We study robustness of Proximal Policy Optimization (PPO) under temporally persistent sensor failures.<n>We show Transformer-based sequence policies substantially outperform, RNN, and SSMs in robustness, maintaining high returns even when large fractions of sensors are unavailable.
arXiv Detail & Related papers (2026-03-04T22:21:54Z)
Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions [0.0]
Agent drift is the progressive degradation of agent behavior, decision quality, and inter-agent coherence over extended interaction sequences.<n>We introduce the Agent Stability Index (ASI), a novel composite metric for quantifying drift across twelve dimensions.<n>We show how unchecked agent drift can lead to substantial reductions in task completion accuracy and increased human intervention requirements.
arXiv Detail & Related papers (2026-01-07T18:37:26Z)
BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models [50.986189632485285]
We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts.<n>Unlike prior approaches that rely on few-step distillation and distribution-matching losses, BAgger trains with standard score or flow matching objectives.<n>We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation.
arXiv Detail & Related papers (2025-12-12T23:02:02Z)
Drift No More? Context Equilibria in Multi-Turn LLM Interactions [58.69551510148673]
contexts drift is the gradual divergence of a model's outputs from goal-consistent behavior across turns.<n>Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics.<n>We show that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay.
arXiv Detail & Related papers (2025-10-09T04:48:49Z)
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
Technical Report: Evaluating Goal Drift in Language Model Agents [0.05567007955507388]
This paper proposes a novel approach to analyzing goal drift in language models (LMs)<n>In our experiments, agents are first explicitly given a goal through their system prompt, then exposed to competing objectives through environmental pressures.<n>We find that goal drift correlates with models' increasing susceptibility to pattern-matching behaviors as the context length grows.
arXiv Detail & Related papers (2025-05-05T15:06:09Z)
datadriftR: An R Package for Concept Drift Detection in Predictive Models [0.0]
This paper introduces drifter, an R package designed to detect concept drift.<n>It proposes a novel method called Profile Drift Detection (PDD) that enables both drift detection and an enhanced understanding of the cause behind the drift.
arXiv Detail & Related papers (2024-12-15T20:59:49Z)
Extreme Miscalibration and the Illusion of Adversarial Robustness [66.29268991629085]
Adversarial Training is often used to increase model robustness. We show that this observed gain in robustness is an illusion of robustness (IOR) We urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations.
arXiv Detail & Related papers (2024-02-27T13:49:12Z)
CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal Relationships [8.679073301435265]
We construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. We use these labels to perturb the data by deleting non-causal agents from the scene. Under non-causal perturbations, we observe a $25$-$38%$ relative change in minADE as compared to the original.
arXiv Detail & Related papers (2022-07-07T21:28:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.