Related papers: Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

URL: http://arxiv.org/abs/2504.04072v2
Date: Fri, 16 May 2025 10:14:51 GMT
Title: Among Us: A Sandbox for Measuring and Detecting Agentic Deception
Authors: Satvik Golechha, Adrià Garriga-Alonso,
Abstract summary: We introduce $textitAmong Us$, a social deception game where language-based agents exhibit long-term, open-ended deception.<n>We find that models trained with RL are comparatively much better at producing deception than detecting it.<n>We also find two SAE features that work well at deception detection but are unable to steer the model to lie less.
Score: 1.1893676124374688
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce $\textit{Among Us}$, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, $\textit{Among Us}$ can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate $18$ proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: $\dots$'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

Related papers

Bayesian Social Deduction with Graph-Informed Language Models [3.7540464038118633]
Social reasoning remains a challenging task for large language models.<n>We introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model.<n>Our approach achieves competitive performance with much larger models in Agent-Agent play.
arXiv Detail & Related papers (2025-06-21T18:45:28Z)
Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [55.044159987218436]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z)
Propaganda via AI? A Study on Semantic Backdoors in Large Language Models [7.282200564983221]
We show that semantic backdoors can be implanted with only a small poisoned corpus.<n>We introduce a black-box detection framework, RAVEN, which combines semantic entropy with cross-model consistency analysis.<n> Empirical evaluations uncover previously undetected semantic backdoors.
arXiv Detail & Related papers (2025-04-15T16:43:15Z)
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems [0.0]
We show that language models can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels.
arXiv Detail & Related papers (2025-04-10T15:07:10Z)
Fooling LLM graders into giving better grades through neural activity guided adversarial prompting [26.164839501935973]
We propose a systematic method to reveal such biases in AI evaluation systems.<n>Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes.<n>We demonstrate that this combination can effectively fool large language model graders into assigning much higher grades than humans would.
arXiv Detail & Related papers (2024-12-17T19:08:22Z)
Towards Action Hijacking of Large Language Model-based Agent [39.19067800226033]
We introduce Name, a novel hijacking attack to manipulate the action plans of black-box agent system.<n>Our approach achieved an average bypass rate of 92.7% for safety filters.
arXiv Detail & Related papers (2024-12-14T12:11:26Z)
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.<n>This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM) To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
Aligning AI Agents via Information-Directed Sampling [20.617552198581024]
bandit alignment problem involves maximizing long-run expected reward by interacting with an environment and a human. We study these trade-offs theoretically and empirically in a toy bandit alignment problem which resembles the beta-Bernoulli bandit. We demonstrate while naive exploration algorithms which reflect current practices and even touted algorithms such as Thompson sampling both fail to provide acceptable solutions to this problem.
arXiv Detail & Related papers (2024-10-18T18:23:41Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
Deception in Reinforced Autonomous Agents [30.510998478048723]
We explore the ability of large language model (LLM)-based agents to engage in subtle deception. This behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles.
arXiv Detail & Related papers (2024-05-07T13:55:11Z)
Can Large Language Models Play Games? A Case Study of A Self-Play Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge. Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions. This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z)
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents [47.219047422240145]
We take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms.
arXiv Detail & Related papers (2024-02-17T06:48:45Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Beyond Labeling Oracles: What does it mean to steal ML models? [52.63413852460003]
Model extraction attacks are designed to steal trained models with only query access. We investigate factors influencing the success of model extraction attacks. Our findings urge the community to redefine the adversarial goals of ME attacks.
arXiv Detail & Related papers (2023-10-03T11:10:21Z)
The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI) We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z)
Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games [35.86199604587823]
Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games. We show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $widetildemathcalO(varepsilon-2)$ samples.
arXiv Detail & Related papers (2022-12-29T20:25:18Z)
Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics. We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z)
H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations. We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework. We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z)
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers. Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z)
Estimating $\alpha$-Rank by Maximizing Information Gain [26.440923373794444]
Game theory has been increasingly applied in settings where the game is not known outright, but has to be estimated by sampling. In this paper, we focus on $alpha$-rank, a popular game-theoretic solution concept designed to perform well in such scenarios. We show the benefits of using information gain as compared to the confidence interval criterion of ResponseGraphUCB.
arXiv Detail & Related papers (2021-01-22T15:46:35Z)
Munchausen Reinforcement Learning [50.396037940989146]
bootstrapping is a core mechanism in Reinforcement Learning (RL) We show that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games. We provide strong theoretical insights on what happens under the hood -- implicit Kullback-Leibler regularization and increase of the action-gap.
arXiv Detail & Related papers (2020-07-28T18:30:23Z)
Maximizing Information Gain in Partially Observable Environments via Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent. We derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.