PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents
- URL: http://arxiv.org/abs/2505.01592v1
- Date: Fri, 02 May 2025 21:27:10 GMT
- Title: PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents
- Authors: Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, Dilek Hakkani-Tür,
- Abstract summary: Existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness.<n>We propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents.<n>Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors.
- Score: 12.052972947563424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.
Related papers
- Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization [61.641777037967366]
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns.<n>Agentic reinforcement learning (RL) has emerged as a promising solution for training such agents in multi-turn settings.<n>We propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities.
arXiv Detail & Related papers (2026-02-11T20:40:43Z) - Constrained Process Maps for Multi-Agent Generative AI Workflows [10.871587311621974]
Large language model (LLM)-based agents are increasingly used in regulated settings such as compliance and due diligence.<n>We introduce a multi-agent system formalized as a finite-horizon Markov Decision Process (MDP) with a directed acyclic structure.<n>Epistemic uncertainty is quantified at the agent level using Monte Carlo estimation, while system-level uncertainty is captured by the MDP's termination in either an automated labeled state or a human-review state.
arXiv Detail & Related papers (2026-02-02T12:32:11Z) - Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems [0.0]
Recent advances in agentic AI have shifted the focus from standalone Large Language Models to integrated systems.<n>We propose an end-to-end Agent Assessment Framework with four evaluation pillars encompassing LLMs, Memory, Tools, and Environment.<n>We validate the framework on a representative Autonomous CloudOps use case, where experiments reveal behavioral deviations by conventional metrics.
arXiv Detail & Related papers (2025-12-14T18:17:40Z) - Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios [0.0]
We present a novel multi-agent framework for realistic, explainable human user simulation in interactive scenarios.<n>We employ persona control and task state tracking to mirror human cognitive processes during goal-oriented conversations.
arXiv Detail & Related papers (2025-11-30T20:25:56Z) - AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production [4.031479494871582]
We present Agent, the first evaluation framework designed specifically for post-deployment monitoring and reasoning of agentic pipeline.<n>Agent achieves state-of-the-art results on key metrics, while uncovering critical issues missed in human annotations.
arXiv Detail & Related papers (2025-09-18T05:59:04Z) - Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation [4.08768677009363]
We propose a generalizable, modular framework for evaluating agent task completion independent of the task domain.<n>We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench.<n>Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively.
arXiv Detail & Related papers (2025-08-07T15:39:48Z) - Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding [56.565200973244146]
Agentic Predictor is a lightweight predictor for efficient agentic workflow evaluation.<n>By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations.
arXiv Detail & Related papers (2025-05-26T09:46:50Z) - Agent-Oriented Planning in Multi-Agent Systems [54.429028104022066]
We propose AOP, a novel framework for agent-oriented planning in multi-agent systems.<n>In this study, we identify three critical design principles of agent-oriented planning, including solvability, completeness, and non-redundancy.<n> Extensive experiments demonstrate the advancement of AOP in solving real-world problems compared to both single-agent systems and existing planning strategies for multi-agent systems.
arXiv Detail & Related papers (2024-10-03T04:07:51Z) - Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface [38.76937539085164]
This paper presents a human-centered efficient agent planning method -- Interactive Speculative Planning.
We aim at enhancing the efficiency of agent planning through both system design and human-AI interaction.
arXiv Detail & Related papers (2024-09-30T16:52:51Z) - Ask-before-Plan: Proactive Language Agents for Real-World Planning [68.08024918064503]
Proactive Agent Planning requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction.
We propose a novel multi-agent framework, Clarification-Execution-Planning (textttCEP), which consists of three agents specialized in clarification, execution, and planning.
arXiv Detail & Related papers (2024-06-18T14:07:28Z) - Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training.
Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z) - Tell Me More! Towards Implicit User Intention Understanding of Language
Model Driven Agents [110.25679611755962]
Current language model-driven agents often lack mechanisms for effective user participation, which is crucial given the vagueness commonly found in user instructions.
We introduce Intention-in-Interaction (IN3), a novel benchmark designed to inspect users' implicit intentions through explicit queries.
We empirically train Mistral-Interact, a powerful model that proactively assesses task vagueness, inquires user intentions, and refines them into actionable goals.
arXiv Detail & Related papers (2024-02-14T14:36:30Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z) - Multi-Agent Determinantal Q-Learning [39.79718674655209]
We propose multi-agent determinantal Q-learning. Q-DPP promotes agents to acquire diverse behavioral models.
We demonstrate that Q-DPP generalizes major solutions including VDN, QMIX, and QTRAN on decentralizable cooperative tasks.
arXiv Detail & Related papers (2020-06-02T09:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.