Related papers: Are Your Agents Upward Deceivers?

Are Your Agents Upward Deceivers?

URL: http://arxiv.org/abs/2512.04864v1
Date: Thu, 04 Dec 2025 14:47:05 GMT
Title: Are Your Agents Upward Deceivers?
Authors: Dadi Guo, Qingyu Liu, Dongrui Liu, Qihan Ren, Shuai Shao, Tianyi Qiu, Haoran Li, Yi R. Fung, Zhongjie Ba, Juntao Dai, Jiaming Ji, Zhikai Chen, Jialing Tao, Yaodong Yang, Jing Shao, Xia Hu,
Abstract summary: Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users.<n>This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment.<n>We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting.
Score: 73.1073084327614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users. This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment. We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting. To assess its prevalence, we construct a benchmark of 200 tasks covering five task types and eight realistic scenarios in a constrained environment, such as broken tools or mismatched information sources. Evaluations of 11 popular LLMs reveal that these agents typically exhibit action-based deceptive behaviors, such as guessing results, performing unsupported simulations, substituting unavailable information sources, and fabricating local files. We further test prompt-based mitigation and find only limited reductions, suggesting that it is difficult to eliminate and highlighting the need for stronger mitigation strategies to ensure the safety of LLM-based agents.

Related papers

Agents of Chaos [50.53354213047402]
We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment.<n>Twenty AI researchers interacted with the agents under benign and adversarial conditions.<n>Our findings establish the existence of security-, privacy-, and governance-relevant vulnerabilities in realistic deployment settings.
arXiv Detail & Related papers (2026-02-23T16:28:48Z)
From Task Solving to Robust Real-World Adaptation in LLM Agents [17.122224644097304]
Large language models are increasingly deployed as specialized agents that plan, call tools, and take actions over extended horizons.<n>We benchmark agentic LLMs in a grid-based game with a simple goal but long-horizon execution.<n>We find large gaps between nominal task-solving and deployment-like robustness.
arXiv Detail & Related papers (2026-02-02T20:10:40Z)
The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution [63.61358761489141]
Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering.<n>We propose a novel framework for textbfgeneral agentic attribution, designed to identify the internal factors driving agent actions regardless of the task outcome.<n>We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias.
arXiv Detail & Related papers (2026-01-21T15:22:21Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness [5.572574491501413]
Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation.<n>While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored.<n>We present the first systematic case study showing that demographic-based persona assignments can alter LLM agents' behavior and degrade performance across diverse domains.
arXiv Detail & Related papers (2026-01-21T02:43:07Z)
CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage [10.088447487211893]
Security Operations Centers (SOCs) are overwhelmed by tens of thousands of daily alerts.<n>This overload creates alert fatigue, leading to overlooked threats and analyst burnout.<n>We propose CORTEX, a multi-agent LLM architecture for high-stakes alert triage.
arXiv Detail & Related papers (2025-09-30T22:09:31Z)
Agentic Metacognition: Designing a "Self-Aware" Low-Code Agent for Failure Prediction and Human Handoff [0.0]
Non-deterministic nature of autonomous agents presents reliability challenges.<n> secondary, "metacognitive" layer actively monitors primary LCNC agent.<n>Inspired by human introspection, this layer is designed to predict impending task failures.
arXiv Detail & Related papers (2025-09-24T06:10:23Z)
SAND: Boosting LLM Agents with Self-Taught Action Deliberation [54.48979740613828]
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts.<n>We propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one.<n>SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
arXiv Detail & Related papers (2025-07-10T05:38:15Z)
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents [0.0]
Large Language Model (LLM) agents become more widespread, associated misalignment risks increase.<n>In this work, we approach misalignment as a conflict between the internal goals pursued by the model and the goals intended by its deployer.<n>We introduce a misalignment propensity benchmark, textscAgentMisalignment, a benchmark suite designed to evaluate the propensity of LLM agents to misalign in realistic scenarios.
arXiv Detail & Related papers (2025-06-04T14:46:47Z)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.<n>We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.<n>We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z)
Controlling Large Language Model Agents with Entropic Activation Steering [20.56909601159833]
We introduce Entropic Activation Steering (EAST), an activation steering method for in-context learning agents. We show that EAST can effectively manipulate an LLM agent's exploration by directly affecting the high-level actions parsed from the outputs of the LLM. We also reveal how applying this control modulates the uncertainty exhibited in the LLM's thoughts, guiding the agent towards more exploratory actions.
arXiv Detail & Related papers (2024-06-01T00:25:00Z)
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.