User Simulation with Large Language Models for Evaluating Task-Oriented
Dialogue
- URL: http://arxiv.org/abs/2309.13233v1
- Date: Sat, 23 Sep 2023 02:04:57 GMT
- Title: User Simulation with Large Language Models for Evaluating Task-Oriented
Dialogue
- Authors: Sam Davidson, Salvatore Romeo, Raphael Shu, James Gung, Arshit Gupta,
Saab Mansour, Yi Zhang
- Abstract summary: We propose a novel user simulator built using recently developed large pretrained language models (LLMs)
Unlike previous work, which sought to maximize goal success rate (GSR) as the primary metric of simulator performance, our goal is a system which achieves a GSR similar to that observed in human interactions with TOD systems.
- Score: 10.336443286833145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the major impediments to the development of new task-oriented dialogue
(TOD) systems is the need for human evaluation at multiple stages and
iterations of the development process. In an effort to move toward automated
evaluation of TOD, we propose a novel user simulator built using recently
developed large pretrained language models (LLMs). In order to increase the
linguistic diversity of our system relative to the related previous work, we do
not fine-tune the LLMs used by our system on existing TOD datasets; rather we
use in-context learning to prompt the LLMs to generate robust and
linguistically diverse output with the goal of simulating the behavior of human
interlocutors. Unlike previous work, which sought to maximize goal success rate
(GSR) as the primary metric of simulator performance, our goal is a system
which achieves a GSR similar to that observed in human interactions with TOD
systems. Using this approach, our current simulator is effectively able to
interact with several TOD systems, especially on single-intent conversational
goals, while generating lexically and syntactically diverse output relative to
previous simulators that rely upon fine-tuned models. Finally, we collect a
Human2Bot dataset of humans interacting with the same TOD systems with which we
experimented in order to better quantify these achievements.
Related papers
- Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems [6.8738526619759535]
offline datasets have been used to evaluate task-oriented dialogue (TOD) models.
User-agents, which are context-aware, can simulate the variability and unpredictability of human conversations.
arXiv Detail & Related papers (2024-11-15T06:05:45Z) - Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models [16.94819621353007]
SynTOD is a new synthetic data generation approach for developing end-to-end Task-Oriented Dialogue (TOD) systems.
It generates diverse, structured conversations through random walks and response simulation using large language models.
In our experiments, using graph-guided response simulations leads to significant improvements in intent classification, slot filling and response relevance.
arXiv Detail & Related papers (2024-04-23T06:23:34Z) - Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems [2.788542465279969]
This paper introduces DAUS, a Domain-Aware User Simulator.
We fine-tune DAUS on real examples of task-oriented dialogues.
Results on two relevant benchmarks showcase significant improvements in terms of user goal fulfillment.
arXiv Detail & Related papers (2024-02-20T20:57:47Z) - SpeechAgents: Human-Communication Simulation with Multi-Modal
Multi-Agent Systems [53.94772445896213]
Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society.
We propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication.
arXiv Detail & Related papers (2024-01-08T15:01:08Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Enhancing Large Language Model Induced Task-Oriented Dialogue Systems
Through Look-Forward Motivated Goals [76.69419538047813]
ProToD approach anticipates the future dialogue actions and incorporates the goal-oriented reward signal to enhance ToD systems.
We present a novel evaluation method that assesses ToD systems based on goal-driven dialogue simulations.
Empirical experiments conducted on the MultiWoZ 2.1 dataset demonstrate that our model can achieve superior performance using only 10% of the data.
arXiv Detail & Related papers (2023-09-16T10:56:00Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z) - Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with
User Simulator [37.590563896382456]
We propose an interactive evaluation framework for Task-Oriented Dialogue (TOD) systems.
We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues.
Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates.
arXiv Detail & Related papers (2022-10-26T07:41:32Z) - SOLOIST: Building Task Bots at Scale with Transfer Learning and Machine
Teaching [81.45928589522032]
We parameterize modular task-oriented dialog systems using a Transformer-based auto-regressive language model.
We pre-train, on heterogeneous dialog corpora, a task-grounded response generation model.
Experiments show that SOLOIST creates new state-of-the-art on well-studied task-oriented dialog benchmarks.
arXiv Detail & Related papers (2020-05-11T17:58:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.