ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System
- URL: http://arxiv.org/abs/2601.11854v1
- Date: Sat, 17 Jan 2026 00:53:43 GMT
- Title: ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System
- Authors: Yifei Zhang, Hooshang Nayyeri, Rinat Khaziev, Emine Yilmaz, Gokhan Tur, Dilek Hakkani-Tür, Hari Thadakamalla,
- Abstract summary: Recent advances in task-oriented dialogue (TOD) systems have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution.<n>These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors.<n>We introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning.<n>We propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation.
- Score: 27.78128349257987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation. We further present a strong agentic memory-based evaluator for benchmarking on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment across task completion, agentic capability, and response quality, and that the proposed evaluator offers a better accuracy-efficiency tradeoff compared to existing memory- and LLM-based approaches under this evaluation setting.
Related papers
- Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems [0.0]
Recent advances in agentic AI have shifted the focus from standalone Large Language Models to integrated systems.<n>We propose an end-to-end Agent Assessment Framework with four evaluation pillars encompassing LLMs, Memory, Tools, and Environment.<n>We validate the framework on a representative Autonomous CloudOps use case, where experiments reveal behavioral deviations by conventional metrics.
arXiv Detail & Related papers (2025-12-14T18:17:40Z) - Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation [47.85891728056131]
PRDBench is a novel benchmark comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Document (PRD) requirements, comprehensive evaluation criteria, and reference implementations.<n>We employ an Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of various test types beyond unit tests.
arXiv Detail & Related papers (2025-10-28T12:26:45Z) - Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models [0.0]
We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems.<n>Our method segments conversations by user goals and evaluates success using all relevant turns.<n>In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system.
arXiv Detail & Related papers (2025-10-04T06:22:47Z) - A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports [24.09178055088843]
Deep Research Agents (DRAs) exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output.<n>This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses.<n>The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness.
arXiv Detail & Related papers (2025-10-02T16:40:02Z) - TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons [11.961955016373379]
TD-EVAL (Turn and Dialogue-level Evaluation) is a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons.<n>We show that TD-EVAL effectively identifies the conversational errors that conventional metrics miss.<n>It also exhibits better alignment with human judgments than traditional and Large Language Models-based metrics.
arXiv Detail & Related papers (2025-04-28T16:57:17Z) - TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains [19.492393243160244]
Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains.<n>Existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets.<n>We propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains.
arXiv Detail & Related papers (2024-10-15T11:20:42Z) - CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems [60.27663010453209]
We leverage large language models (LLMs) to generate satisfaction-aware counterfactual dialogues.
We gather human annotations to ensure the reliability of the generated samples.
Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems.
arXiv Detail & Related papers (2024-03-27T23:45:31Z) - InstructTODS: Large Language Models for End-to-End Task-Oriented
Dialogue Systems [60.53276524369498]
Large language models (LLMs) have been used for diverse tasks in natural language processing (NLP)
We present InstructTODS, a novel framework for zero-shot end-to-end task-oriented dialogue systems.
InstructTODS generates a proxy belief state that seamlessly translates user intentions into dynamic queries.
arXiv Detail & Related papers (2023-10-13T06:36:26Z) - Enhancing Large Language Model Induced Task-Oriented Dialogue Systems
Through Look-Forward Motivated Goals [76.69419538047813]
ProToD approach anticipates the future dialogue actions and incorporates the goal-oriented reward signal to enhance ToD systems.
We present a novel evaluation method that assesses ToD systems based on goal-driven dialogue simulations.
Empirical experiments conducted on the MultiWoZ 2.1 dataset demonstrate that our model can achieve superior performance using only 10% of the data.
arXiv Detail & Related papers (2023-09-16T10:56:00Z) - VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue [70.64560638766018]
We propose textbfVDialogUE, a textbfVisually-grounded textbfDialogue benchmark for textbfUnified textbfEvaluation.
It defines five core multi-modal dialogue tasks and covers six datasets.
We also present a straightforward yet efficient baseline model, named textbfVISIT(textbfVISually-grounded dtextbfIalog textbfTransformer), to promote the advancement of
arXiv Detail & Related papers (2023-09-14T02:09:20Z) - JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents [59.091663077007304]
We propose JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents.<n>Our framework achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC)<n>Our model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.
arXiv Detail & Related papers (2022-08-28T18:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.