Assessing Task-based Chatbots: Snapshot and Curated Datasets for Dialogflow
- URL: http://arxiv.org/abs/2601.19787v1
- Date: Tue, 27 Jan 2026 16:49:56 GMT
- Title: Assessing Task-based Chatbots: Snapshot and Curated Datasets for Dialogflow
- Authors: Elena Masserini, Diego Clerissi, Daniela Micucci, Leonardo Mariani,
- Abstract summary: This paper presents TOFU-D, a snapshot of 1,788 Dialogflow chatbots from GitHub, and COD, a curated subset of TOFU-D including 185 validated chatbots.<n>A preliminary assessment using the Botium testing framework and the Bandit static analyzer revealed gaps in test coverage and frequent security vulnerabilities in several chatbots.
- Score: 5.64612424709862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, chatbots have gained widespread adoption thanks to their ability to assist users at any time and across diverse domains. However, the lack of large-scale curated datasets limits research on their quality and reliability. This paper presents TOFU-D, a snapshot of 1,788 Dialogflow chatbots from GitHub, and COD, a curated subset of TOFU-D including 185 validated chatbots. The two datasets capture a wide range of domains, languages, and implementation patterns, offering a sound basis for empirical studies on chatbot quality and security. A preliminary assessment using the Botium testing framework and the Bandit static analyzer revealed gaps in test coverage and frequent security vulnerabilities in several chatbots, highlighting the need for systematic, multi-Platform research on chatbot quality and security.
Related papers
- Automated Testing of Task-based Chatbots: How Far Are We? [5.64612424709862]
Task-based chatbots are software, typically embedded in real-world applications, that assist users in completing tasks through a conversational interface.<n>In this paper, we evaluate the effectiveness of state-of-the-art testing techniques on a curated selection of task-based chatbots from GitHub.
arXiv Detail & Related papers (2026-02-13T16:32:50Z) - Towards Multi-Platform Mutation Testing of Task-based Chatbots [5.64612424709862]
We present our extension of MUTABOT to multiple platforms (Dialogflow and Rasa)<n>MUTABOT is a mutation testing approach for injecting faults in conversations.<n>We show how mutation testing can be used to reveal weaknesses in test suites generated by the Botium state-of-the-art test generator.
arXiv Detail & Related papers (2025-09-01T11:36:06Z) - Towards the Assessment of Task-based Chatbots: From the TOFU-R Snapshot to the BRASATO Curated Dataset [4.236238836715225]
In this paper, we present two datasets and the tool support necessary to create and maintain these datasets.<n>The first dataset is RASA TASK-BASED CHATBOTS FROM GITHUB (TOFU-R), which is a snapshot of the Rasa chatbots available on GitHub.<n>The second dataset is BOT RASA COLLECTION (BRASATO), a curated selection of the most relevant chatbots for dialogue complexity, functional complexity, and utility.
arXiv Detail & Related papers (2025-08-21T12:24:05Z) - SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness [4.896226014796392]
We introduce SafeChat, a general architecture for building safe and trustworthy chatbots.<n>Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance); (b) usability, with automatic extractive summarization of long responses, traceable to their sources; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices.
arXiv Detail & Related papers (2025-04-08T19:16:43Z) - Seq2Seq Model-Based Chatbot with LSTM and Attention Mechanism for Enhanced User Interaction [1.937324318931008]
This work proposes a Sequence-to-Sequence (Seq2Seq) model with an encoder-decoder architecture that incorporates attention mechanisms and Long Short-Term Memory (LSTM) cells.<n>The proposed Seq2Seq model-based robot is trained, validated, and tested on a dataset specifically for the tourism sector in Draa-Tafilalet, Morocco.
arXiv Detail & Related papers (2024-12-27T23:50:54Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents [70.08842857515141]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.<n>Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn
Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions.
Our framework included a guiding robot and an interlocutor model that plays the role of humans.
We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z) - Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.