Related papers: Building Better Deception Probes Using Targeted Instruction Pairs

Building Better Deception Probes Using Targeted Instruction Pairs

URL: http://arxiv.org/abs/2602.01425v1
Date: Sun, 01 Feb 2026 20:18:11 GMT
Title: Building Better Deception Probes Using Targeted Instruction Pairs
Authors: Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom,
Abstract summary: Linear probes are a promising approach for monitoring AI systems for deceptive behaviour.<n>In this paper, we identify the importance of the instruction pair used during training.<n>We show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets.
Score: 1.610762469264735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.

Related papers

Long-Sequence Recommendation Models Need Decoupled Embeddings [49.410906935283585]
We identify and characterize a neglected deficiency in existing long-sequence recommendation models.<n>A single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes.<n>We propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are learned separately to fully decouple attention and representation.
arXiv Detail & Related papers (2024-10-03T15:45:15Z)
Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z)
Prototypical Contrastive Learning through Alignment and Uniformity for Recommendation [6.790779112538357]
We present underlinePrototypical contrastive learning through underlineAlignment and underlineUniformity for recommendation. Specifically, we first propose prototypes as a latent space to ensure consistency across different augmentations from the origin graph. The absence of explicit negatives means that directly optimizing the consistency loss between instance and prototype could easily result in dimensional collapse issues.
arXiv Detail & Related papers (2024-02-03T08:19:26Z)
XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.<n>XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.<n>Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z)
On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z)
Uncertainty in Contrastive Learning: On the Predictability of Downstream Performance [7.411571833582691]
We study whether the uncertainty of such a representation can be quantified for a single datapoint in a meaningful way. We show that this goal can be achieved by directly estimating the distribution of the training data in the embedding space.
arXiv Detail & Related papers (2022-07-19T15:44:59Z)
Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training [20.242645823965145]
Out-of-scope intent detection is of practical importance in task-oriented dialogue systems. We propose a method to train an out-of-scope intent classifier in a fully end-to-end manner by simulating the test scenario in training. We evaluate our method extensively on four benchmark dialogue datasets and observe significant improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2021-06-16T08:17:18Z)
Generalized Zero-shot Intent Detection via Commonsense Knowledge [5.398580049917152]
We propose RIDE: an intent detection model that leverages commonsense knowledge in an unsupervised fashion to overcome the issue of training data scarcity. RIDE computes robust and generalizable relationship meta-features that capture deep semantic relationships between utterances and intent labels. Our extensive experimental analysis on three widely-used intent detection benchmarks shows that relationship meta-features significantly increase the accuracy of detecting both seen and unseen intents.
arXiv Detail & Related papers (2021-02-04T23:36:41Z)
Provably Efficient Causal Reinforcement Learning with Confounded Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting. We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.