Related papers: Detecting Strategic Deception Using Linear Probes

Detecting Strategic Deception Using Linear Probes

URL: http://arxiv.org/abs/2502.03407v1
Date: Wed, 05 Feb 2025 17:49:40 GMT
Title: Detecting Strategic Deception Using Linear Probes
Authors: Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn,
Abstract summary: We evaluate if linear probes can robustly detect deception by monitoring model activations.<n>We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999.<n>Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while their internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer et al., 2023) and purposely underperforming on safety evaluations (Benton et al., 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed at data.apolloresearch.ai/dd and our code at github.com/ApolloResearch/deception-detection.

Related papers

Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z)
Building Better Deception Probes Using Targeted Instruction Pairs [1.610762469264735]
Linear probes are a promising approach for monitoring AI systems for deceptive behaviour.<n>In this paper, we identify the importance of the instruction pair used during training.<n>We show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets.
arXiv Detail & Related papers (2026-02-01T20:18:11Z)
Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes [79.36545159724703]
We propose Latent Representation Probing (LRP) to train lightweight probes on hidden states or attention patterns.<n>LRP improves abstention accuracy by 7.6% over best baselines.<n>This establishes a principled framework for building deployment-ready AI systems.
arXiv Detail & Related papers (2025-11-25T00:24:42Z)
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z)
Caught in the Act: a mechanistic approach to detecting deception [0.1013295809149289]
We show that linear probes on LLMs can detect deception in their responses with extremely high accuracy.<n>We observe that probes on smaller models (1.5B) achieve chance accuracy at detecting deception, while larger models (greater than 7B) reach 70-80%.<n>We find multitudes of linear directions that encode deception, ranging from 20 in Qwen 3B to nearly 100 in DeepSeek 7B and Qwen 14B models.
arXiv Detail & Related papers (2025-08-27T01:29:52Z)
Benchmarking Fraud Detectors on Private Graph Data [70.4654745317714]
Currently, many types of fraud are managed in part by automated detection algorithms that operate over graphs.<n>We consider the scenario where a data holder wishes to outsource development of fraud detectors to third parties.<n>Third parties submit their fraud detectors to the data holder, who evaluates these algorithms on a private dataset and then publicly communicates the results.<n>We propose a realistic privacy attack on this system that allows an adversary to de-anonymize individuals' data based only on the evaluation results.
arXiv Detail & Related papers (2025-07-30T03:20:15Z)
Anomalous Decision Discovery using Inverse Reinforcement Learning [3.3675535571071746]
Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems.<n>Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios.<n>We present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection.
arXiv Detail & Related papers (2025-07-06T17:01:02Z)
Language Models Can Predict Their Own Behavior [29.566208688211876]
Language models (LMs) can exhibit specific behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment.<n>We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence.<n>An early warning system built on the probes reduces jailbreaking by 91%.
arXiv Detail & Related papers (2025-02-18T23:13:16Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Eliciting Latent Knowledge from Quirky Language Models [1.8035046415192353]
Eliciting Latent Knowledge aims to find patterns in a capable neural network's activations that robustly track the true state of the world. We introduce 12 datasets and a suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs.
arXiv Detail & Related papers (2023-12-02T05:47:22Z)
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive? [93.10694819127608]
We propose a unified evaluation pipeline for forecasting methods with real-world perception inputs. Our in-depth study uncovers a substantial performance gap when transitioning from curated to perception-based data.
arXiv Detail & Related papers (2023-06-15T17:03:14Z)
Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative. We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z)
Out-of-Distribution Detection with Hilbert-Schmidt Independence Optimization [114.43504951058796]
Outlier detection tasks have been playing a critical role in AI safety. Deep neural network classifiers usually tend to incorrectly classify out-of-distribution (OOD) inputs into in-distribution classes with high confidence. We propose an alternative probabilistic paradigm that is both practically useful and theoretically viable for the OOD detection tasks.
arXiv Detail & Related papers (2022-09-26T15:59:55Z)
TRUST-LAPSE: An Explainable and Actionable Mistrust Scoring Framework for Model Monitoring [4.262769931159288]
We propose TRUST-LAPSE, a "mistrust" scoring framework for continuous model monitoring. We assess the trustworthiness of each input sample's model prediction using a sequence of latent-space embeddings. Our latent-space mistrust scores achieve state-of-the-art results with AUROCs of 84.1 (vision), 73.9 (audio), and 77.1 (clinical EEGs)
arXiv Detail & Related papers (2022-07-22T18:32:38Z)
DAD: Data-free Adversarial Defense at Test Time [21.741026088202126]
Deep models are highly susceptible to adversarial attacks. Privacy has become an important concern, restricting access to only trained models but not the training data. We propose a completely novel problem of 'test-time adversarial defense in absence of training data and even their statistics'
arXiv Detail & Related papers (2022-04-04T15:16:13Z)
A Two-Block RNN-based Trajectory Prediction from Incomplete Trajectory [14.725386295605666]
We introduce a two-block RNN model that approximates the inference steps of the Bayesian filtering framework. We show that the proposed model improves the prediction accuracy compared to the three baseline imputation methods. We also show that our proposed method can achieve better prediction compared to the baselines when there is no miss-detection.
arXiv Detail & Related papers (2022-03-14T13:39:44Z)
Tracking the risk of a deployed model and detecting harmful distribution shifts [105.27463615756733]
In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially. We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate.
arXiv Detail & Related papers (2021-10-12T17:21:41Z)
Learn what you can't learn: Regularized Ensembles for Transductive Out-of-distribution Detection [76.39067237772286]
We show that current out-of-distribution (OOD) detection algorithms for neural networks produce unsatisfactory results in a variety of OOD detection scenarios. This paper studies how such "hard" OOD scenarios can benefit from adjusting the detection method after observing a batch of the test data. We propose a novel method that uses an artificial labeling scheme for the test data and regularization to obtain ensembles of models that produce contradictory predictions only on the OOD samples in a test batch.
arXiv Detail & Related papers (2020-12-10T16:55:13Z)
Sequential Anomaly Detection using Inverse Reinforcement Learning [23.554584457413483]
We propose an end-to-end framework for sequential anomaly detection using inverse reinforcement learning (IRL) We use a neural network to represent a reward function. Using a learned reward function, we evaluate whether a new observation from the target agent follows a normal pattern. The empirical study on publicly available real-world data shows that our proposed method is effective in identifying anomalies.
arXiv Detail & Related papers (2020-04-22T05:17:36Z)
Probabilistic Regression for Visual Tracking [193.05958682821444]
We propose a probabilistic regression formulation and apply it to tracking. Our network predicts the conditional probability density of the target state given an input image. Our tracker sets a new state-of-the-art on six datasets, achieving 59.8% AUC on LaSOT and 75.8% Success on TrackingNet.
arXiv Detail & Related papers (2020-03-27T17:58:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.