Caught in the Act: a mechanistic approach to detecting deception
- URL: http://arxiv.org/abs/2508.19505v2
- Date: Tue, 16 Sep 2025 20:22:56 GMT
- Title: Caught in the Act: a mechanistic approach to detecting deception
- Authors: Gerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval,
- Abstract summary: We show that linear probes on LLMs can detect deception in their responses with extremely high accuracy.<n>We observe that probes on smaller models (1.5B) achieve chance accuracy at detecting deception, while larger models (greater than 7B) reach 70-80%.<n>We find multitudes of linear directions that encode deception, ranging from 20 in Qwen 3B to nearly 100 in DeepSeek 7B and Qwen 14B models.
- Score: 0.1013295809149289
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sophisticated instrumentation for AI systems might have indicators that signal misalignment from human values, not unlike a "check engine" light in cars. One such indicator of misalignment is deceptiveness in generated responses. Future AI instrumentation may have the ability to detect when an LLM generates deceptive responses while reasoning about seemingly plausible but incorrect answers to factual questions. In this work, we demonstrate that linear probes on LLMs internal activations can detect deception in their responses with extremely high accuracy. Our probes reach a maximum of greater than 90% accuracy in distinguishing between deceptive and non-deceptive arguments generated by llama and qwen models ranging from 1.5B to 14B parameters, including their DeepSeek-r1 finetuned variants. We observe that probes on smaller models (1.5B) achieve chance accuracy at detecting deception, while larger models (greater than 7B) reach 70-80%, with their reasoning counterparts exceeding 90%. The layer-wise probe accuracy follows a three-stage pattern across layers: near-random (50%) in early layers, peaking in middle layers, and slightly declining in later layers. Furthermore, using an iterative null space projection approach, we find multitudes of linear directions that encode deception, ranging from 20 in Qwen 3B to nearly 100 in DeepSeek 7B and Qwen 14B models.
Related papers
- DRIFT: Detecting Representational Inconsistencies for Factual Truthfulness [5.785021425715989]
LLMs often produce fluent but incorrect answers, yet detecting such hallucinations typically requires multiple sampling passes or post-hoc verification.<n>We propose a lightweight probe to read these signals directly from hidden states.<n>We develop an LLM router that answers confident queries immediately while delegating uncertain ones to stronger models.
arXiv Detail & Related papers (2026-01-20T18:16:10Z) - Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes [79.36545159724703]
We propose Latent Representation Probing (LRP) to train lightweight probes on hidden states or attention patterns.<n>LRP improves abstention accuracy by 7.6% over best baselines.<n>This establishes a principled framework for building deployment-ready AI systems.
arXiv Detail & Related papers (2025-11-25T00:24:42Z) - Probing for Arithmetic Errors in Language Models [86.8227317662622]
Internal activations in language models can be used to detect arithmetic errors.<n>We show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states.<n>We train lightweight error detectors that predict model correctness with over 90% accuracy.
arXiv Detail & Related papers (2025-07-16T16:27:50Z) - A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection [9.335304254034401]
We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself.<n>Our method achieves state-of-the-art detection performance with negligible computational overhead and no compromise to clean accuracy.
arXiv Detail & Related papers (2025-05-19T00:48:53Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - Detecting Strategic Deception Using Linear Probes [0.0]
We evaluate if linear probes can robustly detect deception by monitoring model activations.<n>We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999.<n>Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception.
arXiv Detail & Related papers (2025-02-05T17:49:40Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Eliciting Latent Knowledge from Quirky Language Models [1.8035046415192353]
Eliciting Latent Knowledge aims to find patterns in a capable neural network's activations that robustly track the true state of the world.
We introduce 12 datasets and a suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions.
We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs.
arXiv Detail & Related papers (2023-12-02T05:47:22Z) - Space-based gravitational wave signal detection and extraction with deep
neural network [13.176946557548042]
Space-based gravitational wave (GW) detectors will be able to observe signals from sources that are otherwise nearly impossible from current ground-based detection.
Here, we develop a high-accuracy GW signal detection and extraction method for all space-based GW sources.
arXiv Detail & Related papers (2022-07-15T11:48:15Z) - Neurosymbolic hybrid approach to driver collision warning [64.02492460600905]
There are two main algorithmic approaches to autonomous driving systems.
Deep learning alone has achieved state-of-the-art results in many areas.
But sometimes it can be very difficult to debug if the deep learning model doesn't work.
arXiv Detail & Related papers (2022-03-28T20:29:50Z) - Neural Network Virtual Sensors for Fuel Injection Quantities with
Provable Performance Specifications [71.1911136637719]
We show how provable guarantees can be naturally applied to other real world settings.
We show how specific intervals of fuel injection quantities can be targeted to maximize robustness for certain ranges.
arXiv Detail & Related papers (2020-06-30T23:33:17Z) - Leveraging Uncertainties for Deep Multi-modal Object Detection in
Autonomous Driving [12.310862288230075]
This work presents a probabilistic deep neural network that combines LiDAR point clouds and RGB camera images for robust, accurate 3D object detection.
We explicitly model uncertainties in the classification and regression tasks, and leverage uncertainties to train the fusion network via a sampling mechanism.
arXiv Detail & Related papers (2020-02-01T14:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.