Deception in Reinforced Autonomous Agents
- URL: http://arxiv.org/abs/2405.04325v2
- Date: Fri, 04 Oct 2024 10:23:56 GMT
- Title: Deception in Reinforced Autonomous Agents
- Authors: Atharvan Dogra, Krishna Pillutla, Ameet Deshpande, Ananya B Sai, John Nay, Tanmay Rajpurohit, Ashwin Kalyan, Balaraman Ravindran,
- Abstract summary: We explore the ability of large language model (LLM)-based agents to engage in subtle deception.
This behavior can be hard to detect, unlike blatant lying or unintentional hallucination.
We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles.
- Score: 30.510998478048723
- License:
- Abstract: We explore the ability of large language model (LLM)-based agents to engage in subtle deception such as strategically phrasing and intentionally manipulating information to misguide and deceive other agents. This harmful behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles: a corporate *lobbyist* proposing amendments to bills that benefit a specific company while evading a *critic* trying to detect this deception. We use real-world legislative bills matched with potentially affected companies to ground these interactions. Our results show that LLM lobbyists initially exhibit limited deception against strong LLM critics which can be further improved through simple verbal reinforcement, significantly enhancing their deceptive capabilities, and increasing deception rates by up to 40 points. This highlights the risk of autonomous agents manipulating other agents through seemingly neutral language to attain self-serving goals.
Related papers
- Understanding and Enhancing the Transferability of Jailbreaking Attacks [12.446931518819875]
Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses.
This work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception.
We propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input.
arXiv Detail & Related papers (2025-02-05T10:29:54Z) - Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling [9.305763502526833]
We propose an accountability model for task-oriented dialogue agents to address user overreliance via friction turns.
Our empirical findings demonstrate that the proposed approach not only enables reliable estimation of AI agent errors but also guides the decoder in generating more accurate actions.
arXiv Detail & Related papers (2025-01-17T17:40:12Z) - Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.
This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.
The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.
We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents [27.10147264744531]
We study how language agents navigate scenarios with utility-truthfulness conflicts in a multi-turn interactive setting.
We develop a truthfulness detector inspired by psychological literature to assess the agents' responses.
Our experiment demonstrates that all models are truthful less than 50% of the time, although truthfulness and goal achievement (utility) rates vary across models.
arXiv Detail & Related papers (2024-09-13T17:41:12Z) - Preemptive Detection and Correction of Misaligned Actions in LLM Agents [70.54226917774933]
InferAct is a novel approach to detect misaligned actions before execution.
It alerts users for timely correction, preventing adverse outcomes.
InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z) - DebUnc: Mitigating Hallucinations in Large Language Model Agent Communication with Uncertainty Estimations [52.242449026151846]
DebUnc is a multi-agent debate framework that uses uncertainty metrics to assess agent confidence levels.
We adapted the attention mechanism to adjust token weights based on confidence levels.
Our evaluations show that attention-based methods are particularly effective.
arXiv Detail & Related papers (2024-07-08T22:15:01Z) - "I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust [51.542856739181474]
We show how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance.
We find that first-person expressions decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy.
Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters.
arXiv Detail & Related papers (2024-05-01T16:43:55Z) - Avalon's Game of Thoughts: Battle Against Deception through Recursive
Contemplation [80.126717170151]
This study utilizes the intricate Avalon game as a testbed to explore LLMs' potential in deceptive environments.
We introduce a novel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to identify and counteract deceptive information.
arXiv Detail & Related papers (2023-10-02T16:27:36Z) - Deception Abilities Emerged in Large Language Models [0.0]
Large language models (LLMs) are currently at the forefront of intertwining artificial intelligence (AI) systems with human communication and everyday life.
This study reveals that such strategies emerged in state-of-the-art LLMs, such as GPT-4, but were non-existent in earlier LLMs.
We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents.
arXiv Detail & Related papers (2023-07-31T09:27:01Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.