VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures
- URL: http://arxiv.org/abs/2503.12651v1
- Date: Sun, 16 Mar 2025 21:11:18 GMT
- Title: VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures
- Authors: Yoo Yeon Sung, Hannah Kim, Dan Zhang,
- Abstract summary: Large language model (LLM) agents in compound AI systems often fail to meet human standards, leading to errors that compromise the system's overall performance.<n>This paper introduces a human-centered evaluation framework for verifying LLM Agent failures (VeriLA)<n>VeriLA systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans.
- Score: 3.075266204492352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI practitioners increasingly use large language model (LLM) agents in compound AI systems to solve complex reasoning tasks, these agent executions often fail to meet human standards, leading to errors that compromise the system's overall performance. Addressing these failures through human intervention is challenging due to the agents' opaque reasoning processes, misalignment with human expectations, the complexity of agent dependencies, and the high cost of manual inspection. This paper thus introduces a human-centered evaluation framework for Verifying LLM Agent failures (VeriLA), which systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans. The framework first defines clear expectations of each agent by curating human-designed agent criteria. Then, it develops a human-aligned agent verifier module, trained with human gold standards, to assess each agent's execution output. This approach enables granular evaluation of each agent's performance by revealing failures from a human standard, offering clear guidelines for revision, and reducing human cognitive load. Our case study results show that VeriLA is both interpretable and efficient in helping practitioners interact more effectively with the system. By upholding accountability in human-agent collaboration, VeriLA paves the way for more trustworthy and human-aligned compound AI systems.
Related papers
- Human aversion? Do AI Agents Judge Identity More Harshly Than Performance [0.06554326244334868]
We investigate how AI agents based on large language models assess and integrate human input.
We find that the AI system systematically discounts human advice, penalizing human errors more severely than algorithmic errors.
arXiv Detail & Related papers (2025-03-31T02:05:27Z) - Human Implicit Preference-Based Policy Fine-tuning for Multi-Agent Reinforcement Learning in USV Swarm [21.24766859509835]
Multi-Agent Reinforcement Learning (MARL) has shown promise in solving complex problems involving cooperation and competition among agents.<n>We propose a Reinforcement Learning with Human Feedback (RLHF) approach for MARL that resolves credit-assignment challenges through an Agent-Level Feedback system.<n>Our method effectively refines USV swarm policies, addressing key challenges in multi-agent systems while maintaining fairness and performance consistency.
arXiv Detail & Related papers (2025-03-05T14:33:18Z) - Causal Responsibility Attribution for Human-AI Collaboration [62.474732677086855]
This paper presents a causal framework using Structural Causal Models (SCMs) to systematically attribute responsibility in human-AI systems.
Two case studies illustrate the framework's adaptability in diverse human-AI collaboration scenarios.
arXiv Detail & Related papers (2024-11-05T17:17:45Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement [117.94654815220404]
G"odel Agent is a self-evolving framework inspired by the G"odel machine.<n>G"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.
arXiv Detail & Related papers (2024-10-06T10:49:40Z) - Large Language Model-based Human-Agent Collaboration for Complex Task
Solving [94.3914058341565]
We introduce the problem of Large Language Models (LLMs)-based human-agent collaboration for complex task-solving.
We propose a Reinforcement Learning-based Human-Agent Collaboration method, ReHAC.
This approach includes a policy model designed to determine the most opportune stages for human intervention within the task-solving process.
arXiv Detail & Related papers (2024-02-20T11:03:36Z) - How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors.
Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Compensating for Sensing Failures via Delegation in Human-AI Hybrid
Systems [0.0]
We consider the hybrid human-AI teaming case where a managing agent is tasked with identifying when to perform a delegation assignment.
We model how the environmental context can contribute to, or exacerbate, the sensing deficiencies.
We demonstrate how a Reinforcement Learning (RL) manager can correct the context-delegation association.
arXiv Detail & Related papers (2023-03-02T14:27:01Z) - A Cognitive Framework for Delegation Between Error-Prone AI and Human
Agents [0.0]
We investigate the use of cognitively inspired models of behavior to predict the behavior of both human and AI agents.
The predicted behavior is used to delegate control between humans and AI agents through the use of an intermediary entity.
arXiv Detail & Related papers (2022-04-06T15:15:21Z) - Watch-And-Help: A Challenge for Social Perception and Human-AI
Collaboration [116.28433607265573]
We introduce Watch-And-Help (WAH), a challenge for testing social intelligence in AI agents.
In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently.
We build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines.
arXiv Detail & Related papers (2020-10-19T21:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.