The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs
- URL: http://arxiv.org/abs/2506.12266v1
- Date: Fri, 13 Jun 2025 22:36:41 GMT
- Title: The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs
- Authors: Avinash Baidya, Kamalika Das, Xiang Gao,
- Abstract summary: This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts.<n>Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents.<n>For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with human behavior.
- Score: 8.581146564012856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with human behavior, with low F1 scores for dialog acts (0.464), excessive and often misaligned tool usage with a F1 score of 0.139, and ineffective usage of external knowledge. Reducing such behavior gaps leads to significant performance improvement (24.3% on average). This study highlights the importance of comprehensive behavioral evaluations and improved alignment strategies to enhance the effectiveness of LLM-based TODS in handling complex tasks.
Related papers
- Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications [21.698669254520475]
This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset.<n>We show that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks.<n>We propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review.
arXiv Detail & Related papers (2025-07-01T08:04:58Z) - ORMind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research [53.736407871322314]
We introduce ORMind, a cognitive-inspired framework that enhances optimization through counterfactual reasoning.<n>Our approach emulates human cognition, implementing an end-to-end workflow that transforms requirements into mathematical models and executable code.<n>It is currently being tested internally in Lenovo's AI Assistant, with plans to enhance optimization capabilities for both business and consumer customers.
arXiv Detail & Related papers (2025-06-02T05:11:21Z) - How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior [49.62361184944454]
Memory is a critical component in large language model (LLM)-based agents.<n>We study how memory management choices impact the LLM agents' behavior, especially their long-term performance.
arXiv Detail & Related papers (2025-05-21T22:35:01Z) - Evaluations at Work: Measuring the Capabilities of GenAI in Use [28.124088786766965]
Current AI benchmarks miss the messy, multi-turn nature of human-AI collaboration.<n>We present an evaluation framework that decomposes real-world tasks into interdependent subtasks.
arXiv Detail & Related papers (2025-05-15T23:06:23Z) - Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z) - Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training.
Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z) - Learning to Use Tools via Cooperative and Interactive Agents [58.77710337157665]
Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility.
We propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately.
Our experiments on three datasets show that the LLMs, when equipped with ConAgents, outperform baselines with substantial improvement.
arXiv Detail & Related papers (2024-03-05T15:08:16Z) - OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied
Instruction Following [38.99303334457817]
Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions.
Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in EIF.
We introduce OPEx, a comprehensive framework that delineates the core components essential for solving EIF tasks: Observer, Planner, and Executor.
arXiv Detail & Related papers (2024-03-05T14:53:53Z) - Large Language Model-based Human-Agent Collaboration for Complex Task
Solving [94.3914058341565]
We introduce the problem of Large Language Models (LLMs)-based human-agent collaboration for complex task-solving.
We propose a Reinforcement Learning-based Human-Agent Collaboration method, ReHAC.
This approach includes a policy model designed to determine the most opportune stages for human intervention within the task-solving process.
arXiv Detail & Related papers (2024-02-20T11:03:36Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - Task-Agnostic Continual Reinforcement Learning: Gaining Insights and
Overcoming Challenges [27.474011433615317]
Continual learning (CL) enables the development of models and agents that learn from a sequence of tasks.
We investigate the factors that contribute to the performance differences between task-agnostic CL and multi-task (MTL) agents.
arXiv Detail & Related papers (2022-05-28T17:59:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.