ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation
- URL: http://arxiv.org/abs/2507.16792v1
- Date: Tue, 22 Jul 2025 17:40:34 GMT
- Title: ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation
- Authors: Roman Mayr, Michel Schimpf, Thomas Bohné,
- Abstract summary: ChatChecker is a framework for automated evaluation and testing of complex dialogue systems.<n>It uses large language models (LLMs) to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While modern dialogue systems heavily rely on large language models (LLMs), their implementation often goes beyond pure LLM interaction. Developers integrate multiple LLMs, external tools, and databases. Therefore, assessment of the underlying LLM alone does not suffice, and the dialogue systems must be tested and evaluated as a whole. However, this remains a major challenge. With most previous work focusing on turn-level analysis, less attention has been paid to integrated dialogue-level quality assurance. To address this, we present ChatChecker, a framework for automated evaluation and testing of complex dialogue systems. ChatChecker uses LLMs to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality. Compared to previous approaches, our design reduces setup effort and is generalizable, as it does not require reference dialogues and is decoupled from the implementation of the target dialogue system. We improve breakdown detection performance over a prior LLM-based approach by including an error taxonomy in the prompt. Additionally, we propose a novel non-cooperative user simulator based on challenging personas that uncovers weaknesses in target dialogue systems more effectively. Through this, ChatChecker contributes to thorough and scalable testing. This enables both researchers and practitioners to accelerate the development of robust dialogue systems.
Related papers
- clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations [18.256529559741075]
clem todd is a framework for systematically evaluating dialogue systems under consistent conditions.<n>It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints.<n>Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance.
arXiv Detail & Related papers (2025-05-08T17:36:36Z) - Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression [9.005722141359675]
This study prepared reward models corresponding to 12 metrics related to the impression of the entire dialogue for evaluating dialogue responses.<n>We tuned our dialogue models using the reward model signals as feedback to improve the impression of the system.
arXiv Detail & Related papers (2025-01-22T08:14:51Z) - MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems [9.986269647921073]
Multi-turn interaction is the common real-world usage of dialogue systems.<n>This is largely due to the oracle problem in multi-turn testing.<n>We propose MORTAR, a metamorphic multi-turn dialogue testing approach.
arXiv Detail & Related papers (2024-12-20T04:31:03Z) - DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling [73.08187964426823]
Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction.<n>This paper introduces a new research task--$textbfD$ialogue $textbfE$lement $textbfMO$deling.<n>We propose a novel benchmark, $textbfDEMO$, designed for a comprehensive dialogue modeling and assessment.
arXiv Detail & Related papers (2024-12-06T10:01:38Z) - Are cascade dialogue state tracking models speaking out of turn in
spoken dialogues? [1.786898113631979]
This paper proposes a comprehensive analysis of the errors of state of the art systems in complex settings such as Dialogue State Tracking.
Based on spoken MultiWoz, we identify that errors on non-categorical slots' values are essential to address in order to bridge the gap between spoken and chat-based dialogue systems.
arXiv Detail & Related papers (2023-11-03T08:45:22Z) - DialogBench: Evaluating LLMs as Human-like Dialogue Systems [16.997134341787486]
Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning.
In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks.
We show that instruction tuning improves the human likeness of LLMs to a certain extent, but most LLMs still have much room for improvement as human-like dialogue systems.
arXiv Detail & Related papers (2023-11-03T02:59:56Z) - Self-Explanation Prompting Improves Dialogue Understanding in Large
Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs)
This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks.
Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z) - Prompting and Evaluating Large Language Models for Proactive Dialogues:
Clarification, Target-guided, and Non-collaboration [72.04629217161656]
This work focuses on three aspects of proactive dialogue systems: clarification, target-guided, and non-collaborative dialogues.
To trigger the proactivity of LLMs, we propose the Proactive Chain-of-Thought prompting scheme.
arXiv Detail & Related papers (2023-05-23T02:49:35Z) - Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue
Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response.
We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English.
Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.