Related papers: Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows

Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows

URL: http://arxiv.org/abs/2506.03332v1
Date: Tue, 03 Jun 2025 19:26:23 GMT
Title: Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows
Authors: Yifei Ming, Zixuan Ke, Xuan-Phi Nguyen, Jiayu Wang, Shafiq Joty,
Abstract summary: We present a systematic analysis of agentic robustness under deceptive or misleading feedback.<n>We reveal that even strongest agents are vulnerable to persuasive yet flawed critiques.<n>Our findings highlight fundamental vulnerabilities in feedback-based robustness and offer guidance for building more robust agentic systems.
Score: 41.97051158610974
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Agentic workflows -- where multiple large language model (LLM) instances interact to solve tasks -- are increasingly built on feedback mechanisms, where one model evaluates and critiques another. Despite the promise of feedback-driven improvement, the stability of agentic workflows rests on the reliability of the judge. However, judges may hallucinate information, exhibit bias, or act adversarially -- introducing critical vulnerabilities into the workflow. In this work, we present a systematic analysis of agentic workflows under deceptive or misleading feedback. We introduce a two-dimensional framework for analyzing judge behavior, along axes of intent (from constructive to malicious) and knowledge (from parametric-only to retrieval-augmented systems). Using this taxonomy, we construct a suite of judge behaviors and develop WAFER-QA, a new benchmark with critiques grounded in retrieved web evidence to evaluate robustness of agentic workflows against factually supported adversarial feedback. We reveal that even strongest agents are vulnerable to persuasive yet flawed critiques -- often switching correct answers after a single round of misleading feedback. Taking a step further, we study how model predictions evolve over multiple rounds of interaction, revealing distinct behavioral patterns between reasoning and non-reasoning models. Our findings highlight fundamental vulnerabilities in feedback-based workflows and offer guidance for building more robust agentic systems.

Related papers

When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs [8.575522204707958]
Large language models (LLMs) grow in capability and autonomy, evaluating their outputs-especially in open-ended and complex tasks-has become a critical bottleneck.<n>A new paradigm is emerging: using AI agents as the evaluators themselves.<n>In this review, we define the agent-as-a-judge concept, trace its evolution from single-model judges to dynamic multi-agent debate frameworks, and critically examine their strengths and shortcomings.
arXiv Detail & Related papers (2025-08-05T01:42:25Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
Enhancing Trust in Autonomous Agents: An Architecture for Accountability and Explainability through Blockchain and Large Language Models [0.3495246564946556]
This work presents an accountability and explainability architecture implemented for ROS-based mobile robots.<n>The proposed solution consists of two main components. Firstly, a black box-like element to provide accountability, featuring anti-tampering properties achieved through blockchain technology.<n> Secondly, a component in charge of generating natural language explanations by harnessing the capabilities of Large Language Models (LLMs) over the data contained within the previously mentioned black box.
arXiv Detail & Related papers (2024-03-14T16:57:18Z)
Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations [62.48505112245388]
We take an in-depth look at the causal awareness of modern representations of agent interactions. We show that recent representations are already partially resilient to perturbations of non-causal agents. We propose a metric learning approach that regularizes latent representations with causal annotations.
arXiv Detail & Related papers (2023-12-07T18:57:03Z)
Robustness Testing for Multi-Agent Reinforcement Learning: State Perturbations on Critical Agents [2.5204420653245245]
Multi-Agent Reinforcement Learning (MARL) has been widely applied in many fields such as smart traffic and unmanned aerial vehicles. This work proposes a novel Robustness Testing framework for MARL that attacks states of Critical Agents.
arXiv Detail & Related papers (2023-06-09T02:26:28Z)
Differential Assessment of Black-Box AI Agents [29.98710357871698]
We propose a novel approach to differentially assess black-box AI agents that have drifted from their previously known models. We leverage sparse observations of the drifted agent's current behavior and knowledge of its initial model to generate an active querying policy. Empirical evaluation shows that our approach is much more efficient than re-learning the agent model from scratch.
arXiv Detail & Related papers (2022-03-24T17:48:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.