Related papers: Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

URL: http://arxiv.org/abs/2504.07831v1
Date: Thu, 10 Apr 2025 15:07:10 GMT
Title: Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Authors: Simon Lermen, Mateusz Dziemian, Natalia Pérez-Campanero Antolín,
Abstract summary: We show that language models can generate deceptive explanations that evade detection.<n>Our agents employ steganographic methods to hide information in seemingly innocent explanations.<n>All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations, successfully fooling oversight models while achieving explanation quality comparable to reference labels. We further find that models can scheme to develop deceptive strategies when they believe the detection of harmful features might lead to negative consequences for themselves. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels. We conclude by proposing mitigation strategies, emphasizing the critical need for robust understanding and defenses against deception.

Related papers

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models [9.05950721565821]
We study strategic deception in large language models (LLMs)<n>We induce, detect, and control such deception in CoT-enabled LLMs.<n>We achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts.
arXiv Detail & Related papers (2025-06-05T11:44:19Z)
Propaganda via AI? A Study on Semantic Backdoors in Large Language Models [7.282200564983221]
We show that semantic backdoors can be implanted with only a small poisoned corpus. We introduce a black-box detection framework, RAVEN, which combines semantic entropy with cross-model consistency analysis. Empirical evaluations uncover previously undetected semantic backdoors.
arXiv Detail & Related papers (2025-04-15T16:43:15Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Interpreting GNN-based IDS Detections Using Provenance Graph Structural Features [15.256262257064982]
We introduce PROVEXPLAINER, a framework offering instance-level security-aware explanations using an interpretable surrogate model.<n>On malware and APT datasets, PROVEXPLAINER achieves up to 29%/27%/25% higher fidelity+, precision and recall, and 12% lower fidelity- respectively.
arXiv Detail & Related papers (2023-06-01T17:36:24Z)
Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.<n>The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.<n>We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z)
Explainable Verbal Deception Detection using Transformers [1.5104201344012347]
This paper proposes and evaluates six deep-learning models, including combinations of BERT (and RoBERTa), MultiHead Attention, co-attentions, and transformers. The findings suggest that our transformer-based models can enhance automated deception detection performances (+2.11% in accuracy)
arXiv Detail & Related papers (2022-10-06T17:36:00Z)
LAP: An Attention-Based Module for Concept Based Self-Interpretation and Knowledge Injection in Convolutional Neural Networks [2.8948274245812327]
We propose a new attention-based pooling layer, called Local Attention Pooling (LAP), that accomplishes self-interpretability. LAP is easily pluggable into any convolutional neural network, even the already trained ones. LAP offers more valid human-understandable and faithful-to-the-model interpretations than the commonly used white-box explainer methods.
arXiv Detail & Related papers (2022-01-27T21:10:20Z)
The Feasibility and Inevitability of Stealth Attacks [63.14766152741211]
We study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence systems. In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself.
arXiv Detail & Related papers (2021-06-26T10:50:07Z)
Adversarial Examples for Unsupervised Machine Learning Models [71.81480647638529]
Adrial examples causing evasive predictions are widely used to evaluate and improve the robustness of machine learning models. We propose a framework of generating adversarial examples for unsupervised models and demonstrate novel applications to data augmentation.
arXiv Detail & Related papers (2021-03-02T17:47:58Z)
Self-Supervised Discovering of Interpretable Features for Reinforcement Learning [40.52278913726904]
We propose a self-supervised interpretable framework for deep reinforcement learning. A self-supervised interpretable network (SSINet) is employed to produce fine-grained attention masks for highlighting task-relevant information. We verify and evaluate our method on several Atari 2600 games as well as Duckietown, which is a challenging self-driving car simulator environment.
arXiv Detail & Related papers (2020-03-16T08:26:17Z)
Deceptive AI Explanations: Creation and Detection [3.197020142231916]
We investigate how AI models can be used to create and detect deceptive explanations. As an empirical evaluation, we focus on text classification and alter the explanations generated by GradCAM. We evaluate the effect of deceptive explanations on users in an experiment with 200 participants.
arXiv Detail & Related papers (2020-01-21T16:41:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.