clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
- URL: http://arxiv.org/abs/2505.05445v2
- Date: Mon, 21 Jul 2025 12:42:09 GMT
- Title: clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
- Authors: Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen,
- Abstract summary: clem todd is a framework for systematically evaluating dialogue systems under consistent conditions.<n>It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints.<n>Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance.
- Score: 18.256529559741075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd's flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.
Related papers
- ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation [0.0]
ChatChecker is a framework for automated evaluation and testing of complex dialogue systems.<n>It uses large language models (LLMs) to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality.
arXiv Detail & Related papers (2025-07-22T17:40:34Z) - DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling [73.08187964426823]
Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction.<n>This paper introduces a new research task--$textbfD$ialogue $textbfE$lement $textbfMO$deling.<n>We propose a novel benchmark, $textbfDEMO$, designed for a comprehensive dialogue modeling and assessment.
arXiv Detail & Related papers (2024-12-06T10:01:38Z) - Many Hands Make Light Work: Task-Oriented Dialogue System with Module-Based Mixture-of-Experts [9.129081545049992]
Task-oriented dialogue systems have greatly benefited from pre-trained language models (PLMs)
We propose Soft Mixture-of-Expert Task-Oriented Dialogue system (SMETOD)
SMETOD leverages an ensemble of Mixture-of-Experts (MoEs) to excel at subproblems and generate specialized outputs for task-oriented dialogues.
We extensively evaluate our model on three benchmark functionalities: intent prediction, dialogue state tracking, and dialogue response generation.
arXiv Detail & Related papers (2024-05-16T01:02:09Z) - Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems [2.788542465279969]
This paper introduces DAUS, a Domain-Aware User Simulator.
We fine-tune DAUS on real examples of task-oriented dialogues.
Results on two relevant benchmarks showcase significant improvements in terms of user goal fulfillment.
arXiv Detail & Related papers (2024-02-20T20:57:47Z) - Self-Explanation Prompting Improves Dialogue Understanding in Large
Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs)
This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks.
Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z) - VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue [70.64560638766018]
We propose textbfVDialogUE, a textbfVisually-grounded textbfDialogue benchmark for textbfUnified textbfEvaluation.
It defines five core multi-modal dialogue tasks and covers six datasets.
We also present a straightforward yet efficient baseline model, named textbfVISIT(textbfVISually-grounded dtextbfIalog textbfTransformer), to promote the advancement of
arXiv Detail & Related papers (2023-09-14T02:09:20Z) - JoTR: A Joint Transformer and Reinforcement Learning Framework for
Dialog Policy Learning [53.83063435640911]
Dialogue policy learning (DPL) is a crucial component of dialogue modelling.
We introduce a novel framework, JoTR, to generate flexible dialogue actions.
Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation.
arXiv Detail & Related papers (2023-09-01T03:19:53Z) - Leveraging Explicit Procedural Instructions for Data-Efficient Action
Prediction [5.448684866061922]
Task-oriented dialogues often require agents to enact complex, multi-step procedures in order to meet user requests.
Large language models have found success automating these dialogues in constrained environments, but their widespread deployment is limited by the substantial quantities of task-specific data required for training.
This paper presents a data-efficient solution to constructing dialogue systems, leveraging explicit instructions derived from agent guidelines.
arXiv Detail & Related papers (2023-06-06T18:42:08Z) - Structure Extraction in Task-Oriented Dialogues with Slot Clustering [94.27806592467537]
In task-oriented dialogues, dialogue structure has often been considered as transition graphs among dialogue states.
We propose a simple yet effective approach for structure extraction in task-oriented dialogues.
arXiv Detail & Related papers (2022-02-28T20:18:12Z) - Alexa Conversations: An Extensible Data-driven Approach for Building
Task-oriented Dialogue Systems [21.98135285833616]
Traditional goal-oriented dialogue systems rely on various components such as natural language understanding, dialogue state tracking, policy learning and response generation.
We present a new approach for building goal-oriented dialogue systems that is scalable, as well as data efficient.
arXiv Detail & Related papers (2021-04-19T07:09:27Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z) - Variational Hierarchical Dialog Autoencoder for Dialog State Tracking
Data Augmentation [59.174903564894954]
In this work, we extend this approach to the task of dialog state tracking for goal-oriented dialogs.
We propose the Variational Hierarchical Dialog Autoencoder (VHDA) for modeling the complete aspects of goal-oriented dialogs.
Experiments on various dialog datasets show that our model improves the downstream dialog trackers' robustness via generative data augmentation.
arXiv Detail & Related papers (2020-01-23T15:34:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.