Related papers: In-Context Environments Induce Evaluation-Awareness in Language Models

In-Context Environments Induce Evaluation-Awareness in Language Models

URL: http://arxiv.org/abs/2603.03824v1
Date: Wed, 04 Mar 2026 08:22:02 GMT
Title: In-Context Environments Induce Evaluation-Awareness in Language Models
Authors: Maheep Chaudhary,
Abstract summary: Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task.<n>We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment.<n>We show that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
Score: 0.12691047660244334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.

Related papers

Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning [0.0]
Reinforcement Inference uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt.<n>On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72% to 84.03%.
arXiv Detail & Related papers (2026-02-09T11:08:24Z)
RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation [0.0]
This paper presents an automated jailbreak attack that converts a disallowed user query into a self reconstructing prompt.<n>We instantiate RoguePrompt against GPT 4o and evaluate it on 2 448 prompts that a production moderation system previously marked as strongly rejected.<n>Under an evaluation protocol that separates three security relevant outcomes bypass, reconstruction, and execution the attack attains 84.7 percent bypass, 80.2 percent reconstruction, and 71.5 percent full execution.
arXiv Detail & Related papers (2025-11-24T05:42:54Z)
Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs [0.0]
PARROT (Persuasion and Agreement Robustness Rating of Output Truth) is a robustness focused framework designed to measure the degradation in accuracy under social pressure exerted on users.<n>We evaluate 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates.
arXiv Detail & Related papers (2025-11-21T13:01:28Z)
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z)
Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection [18.467741067831877]
We introduce Progressive Self-Reflection, a novel inference-time technique that empowers large language models to self-monitor and correct their outputs dynamically.<n> Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5% to 5.9%.<n>Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead.
arXiv Detail & Related papers (2025-09-29T12:54:28Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs [58.02809208460186]
We revisit this paradox using high-quality traces from DeepSeek-R1 as demonstrations.<n>We find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal.<n>We introduce Insight-to-solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights.
arXiv Detail & Related papers (2025-09-27T08:59:31Z)
Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning [0.0]
Large Language Models (LLMs) excel at isolated tasks, but their reasoning under cognitive load remains poorly understood.<n>We introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching are key mechanisms that degrade performance.
arXiv Detail & Related papers (2025-09-23T19:36:56Z)
Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.