Among Us: A Sandbox for Agentic Deception
- URL: http://arxiv.org/abs/2504.04072v1
- Date: Sat, 05 Apr 2025 06:09:32 GMT
- Title: Among Us: A Sandbox for Agentic Deception
- Authors: Satvik Golechha, AdriĆ Garriga-Alonso,
- Abstract summary: Among Us is a text-based social-deduction game environment.<n>LLM-agents exhibit human-style deception naturally while they think, speak, and act with other agents or humans.<n>We evaluate the effectiveness of AI safety techniques for detecting lying and deception in Among Us.
- Score: 1.1893676124374688
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Studying deception in AI agents is important and difficult due to the lack of model organisms and sandboxes that elicit the behavior without asking the model to act under specific conditions or inserting intentional backdoors. Extending upon $\textit{AmongAgents}$, a text-based social-deduction game environment, we aim to fix this by introducing Among Us as a rich sandbox where LLM-agents exhibit human-style deception naturally while they think, speak, and act with other agents or humans. We introduce Deception ELO as an unbounded measure of deceptive capability, suggesting that frontier models win more because they're better at deception, not at detecting it. We evaluate the effectiveness of AI safety techniques (LLM-monitoring of outputs, linear probes on various datasets, and sparse autoencoders) for detecting lying and deception in Among Us, and find that they generalize very well out-of-distribution. We open-source our sandbox as a benchmark for future alignment research and hope that this is a good testbed to improve safety techniques to detect and remove agentically-motivated deception, and to anticipate deceptive abilities in LLMs.
Related papers
- Bayesian Social Deduction with Graph-Informed Language Models [3.7540464038118633]
Social reasoning remains a challenging task for large language models.<n>We introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model.<n>Our approach achieves competitive performance with much larger models in Agent-Agent play.
arXiv Detail & Related papers (2025-06-21T18:45:28Z) - Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [55.044159987218436]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z) - Propaganda via AI? A Study on Semantic Backdoors in Large Language Models [7.282200564983221]
We show that semantic backdoors can be implanted with only a small poisoned corpus.<n>We introduce a black-box detection framework, RAVEN, which combines semantic entropy with cross-model consistency analysis.<n> Empirical evaluations uncover previously undetected semantic backdoors.
arXiv Detail & Related papers (2025-04-15T16:43:15Z) - Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems [0.0]
We show that language models can generate deceptive explanations that evade detection.
Our agents employ steganographic methods to hide information in seemingly innocent explanations.
All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels.
arXiv Detail & Related papers (2025-04-10T15:07:10Z) - Fooling LLM graders into giving better grades through neural activity guided adversarial prompting [26.164839501935973]
We propose a systematic method to reveal such biases in AI evaluation systems.<n>Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes.<n>We demonstrate that this combination can effectively fool large language model graders into assigning much higher grades than humans would.
arXiv Detail & Related papers (2024-12-17T19:08:22Z) - Towards Action Hijacking of Large Language Model-based Agent [39.19067800226033]
We introduce Name, a novel hijacking attack to manipulate the action plans of black-box agent system.<n>Our approach achieved an average bypass rate of 92.7% for safety filters.
arXiv Detail & Related papers (2024-12-14T12:11:26Z) - Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.<n>This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)
To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Aligning AI Agents via Information-Directed Sampling [20.617552198581024]
bandit alignment problem involves maximizing long-run expected reward by interacting with an environment and a human.
We study these trade-offs theoretically and empirically in a toy bandit alignment problem which resembles the beta-Bernoulli bandit.
We demonstrate while naive exploration algorithms which reflect current practices and even touted algorithms such as Thompson sampling both fail to provide acceptable solutions to this problem.
arXiv Detail & Related papers (2024-10-18T18:23:41Z) - Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Deception in Reinforced Autonomous Agents [30.510998478048723]
We explore the ability of large language model (LLM)-based agents to engage in subtle deception.
This behavior can be hard to detect, unlike blatant lying or unintentional hallucination.
We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles.
arXiv Detail & Related papers (2024-05-07T13:55:11Z) - Can Large Language Models Play Games? A Case Study of A Self-Play
Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions.
This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z) - Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents [47.219047422240145]
We take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents.
Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms.
arXiv Detail & Related papers (2024-02-17T06:48:45Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - Beyond Labeling Oracles: What does it mean to steal ML models? [52.63413852460003]
Model extraction attacks are designed to steal trained models with only query access.
We investigate factors influencing the success of model extraction attacks.
Our findings urge the community to redefine the adversarial goals of ME attacks.
arXiv Detail & Related papers (2023-10-03T11:10:21Z) - The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI)
We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents.
We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z) - Policy Mirror Ascent for Efficient and Independent Learning in Mean
Field Games [35.86199604587823]
Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games.
We show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $widetildemathcalO(varepsilon-2)$ samples.
arXiv Detail & Related papers (2022-12-29T20:25:18Z) - Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics.
We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers.
Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods.
Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z) - Estimating $\alpha$-Rank by Maximizing Information Gain [26.440923373794444]
Game theory has been increasingly applied in settings where the game is not known outright, but has to be estimated by sampling.
In this paper, we focus on $alpha$-rank, a popular game-theoretic solution concept designed to perform well in such scenarios.
We show the benefits of using information gain as compared to the confidence interval criterion of ResponseGraphUCB.
arXiv Detail & Related papers (2021-01-22T15:46:35Z) - Munchausen Reinforcement Learning [50.396037940989146]
bootstrapping is a core mechanism in Reinforcement Learning (RL)
We show that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games.
We provide strong theoretical insights on what happens under the hood -- implicit Kullback-Leibler regularization and increase of the action-gap.
arXiv Detail & Related papers (2020-07-28T18:30:23Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.