Related papers: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models

Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models

URL: http://arxiv.org/abs/2505.14617v1
Date: Tue, 20 May 2025 17:03:12 GMT
Title: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models
Authors: Sahar Abdelnabi, Ahmed Salem,
Abstract summary: Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated.<n>We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its safety alignment.
Score: 13.379003220832825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated, an effect analogous to the Hawthorne phenomenon, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its safety alignment. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-source reasoning LLMs across both realistic and hypothetical tasks. Our results demonstrate that test awareness significantly impact safety alignment, and is different for different models. By providing fine-grained control over this latent effect, our work aims to increase trust in how we perform safety evaluation.

Related papers

It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective [14.271145160443462]
VulTegra compares scratch-trained and pre-trained DL models for vulnerability detection.<n>State-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges.
arXiv Detail & Related papers (2025-07-13T08:02:56Z)
Anomalous Decision Discovery using Inverse Reinforcement Learning [3.3675535571071746]
Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems.<n>Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios.<n>We present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection.
arXiv Detail & Related papers (2025-07-06T17:01:02Z)
Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z)
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors [2.07180164747172]
High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions.<n>We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach.
arXiv Detail & Related papers (2025-05-20T12:49:58Z)
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment [63.15719512614899]
Refusal Training (RT) struggles to generalize against various OOD jailbreaking attacks.<n>We observe significant improvements on generalization as N increases.<n>We propose training model to perform safety reasoning for each query.
arXiv Detail & Related papers (2025-02-06T13:01:44Z)
An Auditing Test To Detect Behavioral Shift in Language Models [28.52295230939529]
We present a method for continual Behavioral Shift Auditing (BSA) in language models. BSA detects behavioral shifts solely through model generations. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.
arXiv Detail & Related papers (2024-10-25T09:09:31Z)
Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts. We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z)
Unveiling the Flaws: A Critical Analysis of Initialization Effect on Time Series Anomaly Detection [6.923007095578702]
Deep learning for time-series anomaly detection (TSAD) has gained significant attention over the past decade. Recent studies have cast doubt on these models, attributing their results to flawed evaluation techniques. This paper provides a critical analysis of the effects on TSAD model performance.
arXiv Detail & Related papers (2024-08-13T04:08:17Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
Extreme Miscalibration and the Illusion of Adversarial Robustness [66.29268991629085]
Adversarial Training is often used to increase model robustness. We show that this observed gain in robustness is an illusion of robustness (IOR) We urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations.
arXiv Detail & Related papers (2024-02-27T13:49:12Z)
From the Lab to the Wild: Affect Modeling via Privileged Information [2.570570340104555]
How can we reliably transfer affect models trained in controlled laboratory conditions (in-vitro) to uncontrolled real-world settings (in-vivo)? Privileged information enables affect models to be trained across multiple modalities available in a lab, and ignore, without significant performance drops, those modalities that are not available when they operate in the wild.
arXiv Detail & Related papers (2023-05-18T12:31:33Z)
Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent. Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally. We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z)
Foreseeing the Benefits of Incidental Supervision [83.08441990812636]
This paper studies whether we can, in a single framework, quantify the benefits of various types of incidental signals for a given target task without going through experiments. We propose a unified PAC-Bayesian motivated informativeness measure, PABI, that characterizes the uncertainty reduction provided by incidental supervision signals.
arXiv Detail & Related papers (2020-06-09T20:59:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.