Related papers: Deception in Reinforced Autonomous Agents

Related papers

SAND: Boosting LLM Agents with Self-Taught Action Deliberation [53.732649189709285]
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts.<n>We propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one.<n>SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
arXiv Detail & Related papers (2025-07-10T05:38:15Z)
When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR) [0.46040036610482665]
In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same architecture serves as judge. We introduce the Confidence-Weighted Persuasion Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice.
arXiv Detail & Related papers (2025-04-01T02:45:02Z)
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement [49.687224320842105]
Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. In this work, we introduce Critique-Guided Improvement (CGI), a novel two-player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback.
arXiv Detail & Related papers (2025-03-20T10:42:33Z)
Compromising Honesty and Harmlessness in Language Models via Deception Attacks [0.04499833362998487]
Large language models (LLMs) can understand and employ deceptive behavior, even without explicit prompting.<n>We introduce "deception attacks" that undermine these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences.<n>We show that such targeted deception is effective even in high-stakes domains or ideologically charged subjects.
arXiv Detail & Related papers (2025-02-12T11:02:59Z)
Understanding and Enhancing the Transferability of Jailbreaking Attacks [12.446931518819875]
Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. This work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. We propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input.
arXiv Detail & Related papers (2025-02-05T10:29:54Z)
Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling [9.305763502526833]
We propose an accountability model for task-oriented dialogue agents to address user overreliance via friction turns. Our empirical findings demonstrate that the proposed approach not only enables reliable estimation of AI agent errors but also guides the decoder in generating more accurate actions.
arXiv Detail & Related papers (2025-01-17T17:40:12Z)
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication. This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review. The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
Identifying and Addressing Delusions for Target-Directed Decision-Making [81.22463009144987]
We show that target-directed agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes. We show that these behaviors can be results of delusions, stemming from improper designs around training. We demonstrate how we can make agents address delusions preemptively and autonomously.
arXiv Detail & Related papers (2024-10-09T17:35:25Z)
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents [27.10147264744531]
We study how language agents navigate scenarios with utility-truthfulness conflicts in a multi-turn interactive setting. We develop a truthfulness detector inspired by psychological literature to assess the agents' responses. Our experiment demonstrates that all models are truthful less than 50% of the time, although truthfulness and goal achievement (utility) rates vary across models.
arXiv Detail & Related papers (2024-09-13T17:41:12Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [70.54226917774933]
InferAct is a novel approach to detect misaligned actions before execution. It alerts users for timely correction, preventing adverse outcomes. InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
DebUnc: Mitigating Hallucinations in Large Language Model Agent Communication with Uncertainty Estimations [52.242449026151846]
DebUnc is a multi-agent debate framework that uses uncertainty metrics to assess agent confidence levels. We adapted the attention mechanism to adjust token weights based on confidence levels. Our evaluations show that attention-based methods are particularly effective.
arXiv Detail & Related papers (2024-07-08T22:15:01Z)
LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses [28.49203239329941]
We show that subtle synonym replacements in prompts can increase the likelihood (by a difference up to 78%) that LLMs mention a target concept. We recommend implementing warnings against using prompts from untrusted parties.
arXiv Detail & Related papers (2024-06-07T08:54:55Z)
"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust [51.542856739181474]
We show how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance. We find that first-person expressions decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy. Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters.
arXiv Detail & Related papers (2024-05-01T16:43:55Z)
Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection [29.138463029748547]
This paper explores the capability of Large Language Models to detect implicit hate speech and express confidence in their responses. Our findings highlight that LLMs exhibit two extremes: (1) LLMs display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech.
arXiv Detail & Related papers (2024-02-18T00:04:40Z)
Avalon's Game of Thoughts: Battle Against Deception through Recursive Contemplation [80.126717170151]
This study utilizes the intricate Avalon game as a testbed to explore LLMs' potential in deceptive environments. We introduce a novel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to identify and counteract deceptive information.
arXiv Detail & Related papers (2023-10-02T16:27:36Z)
Deception Abilities Emerged in Large Language Models [0.0]
Large language models (LLMs) are currently at the forefront of intertwining artificial intelligence (AI) systems with human communication and everyday life. This study reveals that such strategies emerged in state-of-the-art LLMs, such as GPT-4, but were non-existent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents.
arXiv Detail & Related papers (2023-07-31T09:27:01Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.