MORTAR: Metamorphic Multi-turn Testing for LLM-based Dialogue Systems
- URL: http://arxiv.org/abs/2412.15557v1
- Date: Fri, 20 Dec 2024 04:31:03 GMT
- Title: MORTAR: Metamorphic Multi-turn Testing for LLM-based Dialogue Systems
- Authors: Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn,
- Abstract summary: We propose MORTAR, a MetamORphic multi-TuRn diAlogue testing appRoach.
MorTAR automates the generation of follow-up question-answer (QA) dialogue test cases.
It detects bugs of multi-turn dialogue systems in a low-cost manner.
- Score: 7.7097144952707435
- License:
- Abstract: With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn scenarios. However, multi-turn dialogue testing remains underexplored, with the Oracle problem in multi-turn testing posing a persistent challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a MetamORphic multi-TuRn diAlogue testing appRoach, which mitigates the test oracle problem in the assessment of LLM-based dialogue systems. MORTAR automates the generation of follow-up question-answer (QA) dialogue test cases with multiple dialogue-level perturbations and metamorphic relations. MORTAR employs a novel knowledge graph-based dialogue information model which effectively generates perturbed dialogue test datasets and detects bugs of multi-turn dialogue systems in a low-cost manner. The proposed approach does not require an LLM as a judge, eliminating potential of any biases in the evaluation step. According to the experiment results on multiple LLM-based dialogue systems and comparisons with single-turn metamorphic testing approaches, MORTAR explores more unique bugs in LLM-based dialogue systems, especially for severe bugs that MORTAR detects up to four times more unique bugs than the most effective existing metamorphic testing approach.
Related papers
- Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations [74.83732294523402]
We introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards.
We also explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes.
Experiments show that dialogue-tuned models outperform traditional methods, with improvements of $9.64%$ in multi-round reasoning scenarios and $6.18%$ in accuracy in a noisy environment.
arXiv Detail & Related papers (2025-01-29T18:58:48Z) - RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues [8.036117602566074]
RAD-Bench is a benchmark designed to evaluate Large Language Models' capabilities in multi-turn dialogues following retrievals.
Our evaluation results on commonly used LLMs reveal that model performance deteriorates as additional layers of conditions or constraints are applied.
arXiv Detail & Related papers (2024-09-19T08:26:45Z) - Cohesive Conversations: Enhancing Authenticity in Multi-Agent Simulated Dialogues [17.38671584773247]
This paper investigates the quality of multi-agent dialogues in simulations powered by Large Language Models (LLMs)
We propose a novel Screening, Diagnosis, and Regeneration (SDR) framework that detects and corrects utterance errors.
arXiv Detail & Related papers (2024-07-13T14:24:45Z) - MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues [58.33076950775072]
MT-Bench-101 is designed to evaluate the fine-grained abilities of Large Language Models (LLMs) in multi-turn dialogues.
We construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks.
We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives.
arXiv Detail & Related papers (2024-02-22T18:21:59Z) - A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges.
We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels.
We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z) - Are cascade dialogue state tracking models speaking out of turn in
spoken dialogues? [1.786898113631979]
This paper proposes a comprehensive analysis of the errors of state of the art systems in complex settings such as Dialogue State Tracking.
Based on spoken MultiWoz, we identify that errors on non-categorical slots' values are essential to address in order to bridge the gap between spoken and chat-based dialogue systems.
arXiv Detail & Related papers (2023-11-03T08:45:22Z) - Self-Explanation Prompting Improves Dialogue Understanding in Large
Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs)
This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks.
Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z) - Prompting and Evaluating Large Language Models for Proactive Dialogues:
Clarification, Target-guided, and Non-collaboration [72.04629217161656]
This work focuses on three aspects of proactive dialogue systems: clarification, target-guided, and non-collaborative dialogues.
To trigger the proactivity of LLMs, we propose the Proactive Chain-of-Thought prompting scheme.
arXiv Detail & Related papers (2023-05-23T02:49:35Z) - In-Context Learning for Few-Shot Dialogue State Tracking [55.91832381893181]
We propose an in-context (IC) learning framework for few-shot dialogue state tracking (DST)
A large pre-trained language model (LM) takes a test instance and a few annotated examples as input, and directly decodes the dialogue states without any parameter updates.
This makes the LM more flexible and scalable compared to prior few-shot DST work when adapting to new domains and scenarios.
arXiv Detail & Related papers (2022-03-16T11:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.