Related papers: MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

URL: http://arxiv.org/abs/2412.15557v3
Date: Mon, 23 Jun 2025 10:23:35 GMT
Title: MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems
Authors: Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen,
Abstract summary: Multi-turn interaction is the common real-world usage of dialogue systems.<n>This is largely due to the oracle problem in multi-turn testing.<n>We propose MORTAR, a metamorphic multi-turn dialogue testing approach.
Score: 9.986269647921073
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.

Related papers

ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation [0.0]
ChatChecker is a framework for automated evaluation and testing of complex dialogue systems.<n>It uses large language models (LLMs) to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality.
arXiv Detail & Related papers (2025-07-22T17:40:34Z)
Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing [19.880543046739252]
Deep learning (DL) frameworks are essential to DL-based software systems, and framework bugs may lead to substantial disasters.<n>Researchers adopt DL models or single interfaces as test inputs and analyze their execution results to detect bugs.<n> floating-point errors, inherent randomness, and the complexity of test inputs make it challenging to analyze execution results effectively.
arXiv Detail & Related papers (2025-07-06T11:38:14Z)
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation [49.12071445991853]
Large Language Models (textbfLLMs) have been widely adopted in real-world dialogue applications.<n>MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues.<n>Experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives.
arXiv Detail & Related papers (2025-05-27T10:28:04Z)
Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations [74.83732294523402]
We introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards. We also explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes. Experiments show that dialogue-tuned models outperform traditional methods, with improvements of $9.64%$ in multi-round reasoning scenarios and $6.18%$ in accuracy in a noisy environment.
arXiv Detail & Related papers (2025-01-29T18:58:48Z)
MindScope: Exploring cognitive biases in large language models through Multi-Agent Systems [12.245537894266803]
We introduce the 'MindScope' dataset, which distinctively integrates static and dynamic elements. The static component comprises 5,170 open-ended questions spanning 72 cognitive bias categories. The dynamic component leverages a rule-based, multi-agent communication framework to facilitate the generation of multi-round dialogues. In addition, we introduce a multi-agent detection method applicable to a wide range of detection tasks, which integrates Retrieval-Augmented Generation (RAG), competitive debate, and a reinforcement learning-based decision module.
arXiv Detail & Related papers (2024-10-06T11:23:56Z)
Cohesive Conversations: Enhancing Authenticity in Multi-Agent Simulated Dialogues [17.38671584773247]
This paper investigates the quality of multi-agent dialogues in simulations powered by Large Language Models (LLMs) We propose a novel Screening, Diagnosis, and Regeneration (SDR) framework that detects and corrects utterance errors.
arXiv Detail & Related papers (2024-07-13T14:24:45Z)
Test Oracle Automation in the era of LLMs [52.69509240442899]
Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks. This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
arXiv Detail & Related papers (2024-05-21T13:19:10Z)
A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems [12.999001024463453]
This paper aims to give a summary of existing LLMs and approaches for adapting LLMs to downstream tasks. It elaborates recent advances in multi-turn dialogue systems, covering both LLM-based open-domain dialogue (ODD) and task-oriented dialogue (TOD) systems.
arXiv Detail & Related papers (2024-02-28T03:16:44Z)
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues [58.33076950775072]
MT-Bench-101 is designed to evaluate the fine-grained abilities of Large Language Models (LLMs) in multi-turn dialogues. We construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives.
arXiv Detail & Related papers (2024-02-22T18:21:59Z)
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges. We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels. We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z)
Are cascade dialogue state tracking models speaking out of turn in spoken dialogues? [1.786898113631979]
This paper proposes a comprehensive analysis of the errors of state of the art systems in complex settings such as Dialogue State Tracking. Based on spoken MultiWoz, we identify that errors on non-categorical slots' values are essential to address in order to bridge the gap between spoken and chat-based dialogue systems.
arXiv Detail & Related papers (2023-11-03T08:45:22Z)
Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs) This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z)
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems [59.1250765143521]
Current knowledge-grounded dialogue systems often fail to align the generated responses with human-preferred qualities. We propose Polished & Informed Candidate Scoring (PICK), a generation re-scoring framework. We demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history.
arXiv Detail & Related papers (2023-09-19T08:27:09Z)
Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration [72.04629217161656]
This work focuses on three aspects of proactive dialogue systems: clarification, target-guided, and non-collaborative dialogues. To trigger the proactivity of LLMs, we propose the Proactive Chain-of-Thought prompting scheme.
arXiv Detail & Related papers (2023-05-23T02:49:35Z)
In-Context Learning for Few-Shot Dialogue State Tracking [55.91832381893181]
We propose an in-context (IC) learning framework for few-shot dialogue state tracking (DST) A large pre-trained language model (LM) takes a test instance and a few annotated examples as input, and directly decodes the dialogue states without any parameter updates. This makes the LM more flexible and scalable compared to prior few-shot DST work when adapting to new domains and scenarios.
arXiv Detail & Related papers (2022-03-16T11:58:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.