Related papers: Towards Multi-Platform Mutation Testing of Task-based Chatbots

Towards Multi-Platform Mutation Testing of Task-based Chatbots

URL: http://arxiv.org/abs/2509.01389v1
Date: Mon, 01 Sep 2025 11:36:06 GMT
Title: Towards Multi-Platform Mutation Testing of Task-based Chatbots
Authors: Diego Clerissi, Elena Masserini, Daniela Micucci, Leonardo Mariani,
Abstract summary: We present our extension of MUTABOT to multiple platforms (Dialogflow and Rasa)<n>MUTABOT is a mutation testing approach for injecting faults in conversations.<n>We show how mutation testing can be used to reveal weaknesses in test suites generated by the Botium state-of-the-art test generator.
Score: 5.64612424709862
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chatbots, also known as conversational agents, have become ubiquitous, offering services for a multitude of domains. Unlike general-purpose chatbots, task-based chatbots are software designed to prioritize the completion of tasks of the domain they handle (e.g., flight booking). Given the growing popularity of chatbots, testing techniques that can generate full conversations as test cases have emerged. Still, thoroughly testing all the possible conversational scenarios implemented by a task-based chatbot is challenging, resulting in incorrect behaviors that may remain unnoticed. To address this challenge, we proposed MUTABOT, a mutation testing approach for injecting faults in conversations and producing faulty chatbots that emulate defects that may affect the conversational aspects. In this paper, we present our extension of MUTABOT to multiple platforms (Dialogflow and Rasa), and present experiments that show how mutation testing can be used to reveal weaknesses in test suites generated by the Botium state-of-the-art test generator.

Related papers

Automated Testing of Task-based Chatbots: How Far Are We? [5.64612424709862]
Task-based chatbots are software, typically embedded in real-world applications, that assist users in completing tasks through a conversational interface.<n>In this paper, we evaluate the effectiveness of state-of-the-art testing techniques on a curated selection of task-based chatbots from GitHub.
arXiv Detail & Related papers (2026-02-13T16:32:50Z)
SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness [4.896226014796392]
We introduce SafeChat, a general architecture for building safe and trustworthy chatbots.<n>Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance); (b) usability, with automatic extractive summarization of long responses, traceable to their sources; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices.
arXiv Detail & Related papers (2025-04-08T19:16:43Z)
Test Case Generation for Dialogflow Task-Based Chatbots [3.488620810035772]
Test Generator (CTG) is an automated testing technique designed for task-based chatbots.<n>We conducted an experiment comparing CTG with state-of-the-art BOTIUM and CHARM tools.<n>CTG outperformed the competitors in terms of robustness and effectiveness.
arXiv Detail & Related papers (2025-03-07T16:39:27Z)
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs [72.38330196290119]
System-prompting is a tool for customizing language-model chatbots, enabling them to follow a specific instruction. We propose a benchmark to test the assumption, evaluating instruction stability via self-chats. We reveal a significant instruction drift within eight rounds of conversations. We propose a lightweight method called split-softmax, which compares favorably against two strong baselines.
arXiv Detail & Related papers (2024-02-13T20:10:29Z)
MutaBot: A Mutation Testing Approach for Chatbots [3.811067614153878]
MutaBot addresses mutations at multiple levels, including conversational flows, intents, and contexts. We assess the tool with three Dialogflow chatbots and test cases generated with Botium, revealing weaknesses in the test suites.
arXiv Detail & Related papers (2024-01-18T20:38:27Z)
Evaluating Chatbots to Promote Users' Trust -- Practices and Open Problems [11.427175278545517]
This paper reviews current practices for testing chatbots. It identifies gaps as open problems in pursuit of user trust. It outlines a path forward to mitigate issues of trust related to service or product performance, user satisfaction and long-term unintended consequences for society.
arXiv Detail & Related papers (2023-09-09T22:40:30Z)
Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard [68.8204255655161]
We use 30 questions that are clear, without any ambiguities, fully described with plain text only, and have a unique, well defined correct answer. The answers are recorded and discussed, highlighting their strengths and weaknesses. It was found that ChatGPT-4 outperforms ChatGPT-3.5 in both sets of questions.
arXiv Detail & Related papers (2023-05-30T11:18:05Z)
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data [101.63682141248069]
Chat models, such as ChatGPT, have shown impressive capabilities and have been rapidly adopted across numerous domains. We propose a pipeline that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT. We employ parameter-efficient tuning to enhance LLaMA, an open-source large language model.
arXiv Detail & Related papers (2023-04-03T17:59:09Z)
BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing. We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes. We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z)
CheerBots: Chatbots toward Empathy and Emotionusing Reinforcement Learning [60.348822346249854]
This study presents a framework whereby several empathetic chatbots are based on understanding users' implied feelings and replying empathetically for multiple dialogue turns. We call these chatbots CheerBots. CheerBots can be retrieval-based or generative-based and were finetuned by deep reinforcement learning. To respond in an empathetic way, we develop a simulating agent, a Conceptual Human Model, as aids for CheerBots in training with considerations on changes in user's emotional states in the future to arouse sympathy.
arXiv Detail & Related papers (2021-10-08T07:44:47Z)
Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions. Our framework included a guiding robot and an interlocutor model that plays the role of humans. We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.