CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems
- URL: http://arxiv.org/abs/2403.19056v2
- Date: Tue, 20 Aug 2024 10:56:18 GMT
- Title: CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems
- Authors: Amin Abolghasemi, Zhaochun Ren, Arian Askari, Mohammad Aliannejadi, Maarten de Rijke, Suzan Verberne,
- Abstract summary: We leverage large language models (LLMs) to generate satisfaction-aware counterfactual dialogues.
We gather human annotations to ensure the reliability of the generated samples.
Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems.
- Score: 60.27663010453209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An important unexplored aspect in previous work on user satisfaction estimation for Task-Oriented Dialogue (TOD) systems is their evaluation in terms of robustness for the identification of user dissatisfaction: current benchmarks for user satisfaction estimation in TOD systems are highly skewed towards dialogues for which the user is satisfied. The effect of having a more balanced set of satisfaction labels on performance is unknown. However, balancing the data with more dissatisfactory dialogue samples requires further data collection and human annotation, which is costly and time-consuming. In this work, we leverage large language models (LLMs) and unlock their ability to generate satisfaction-aware counterfactual dialogues to augment the set of original dialogues of a test collection. We gather human annotations to ensure the reliability of the generated samples. We evaluate two open-source LLMs as user satisfaction estimators on our augmented collection against state-of-the-art fine-tuned models. Our experiments show that when used as few-shot user satisfaction estimators, open-source LLMs show higher robustness to the increase in the number of dissatisfaction labels in the test collection than the fine-tuned state-of-the-art models. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems. We release our aligned counterfactual dialogues, which are curated by human annotation, to facilitate further research on this topic.
Related papers
- Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems [6.8738526619759535]
offline datasets have been used to evaluate task-oriented dialogue (TOD) models.
User-agents, which are context-aware, can simulate the variability and unpredictability of human conversations.
arXiv Detail & Related papers (2024-11-15T06:05:45Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models [35.95405294377247]
Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns.
We show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches.
arXiv Detail & Related papers (2024-03-19T02:57:07Z) - Enhancing Large Language Model Induced Task-Oriented Dialogue Systems
Through Look-Forward Motivated Goals [76.69419538047813]
ProToD approach anticipates the future dialogue actions and incorporates the goal-oriented reward signal to enhance ToD systems.
We present a novel evaluation method that assesses ToD systems based on goal-driven dialogue simulations.
Empirical experiments conducted on the MultiWoZ 2.1 dataset demonstrate that our model can achieve superior performance using only 10% of the data.
arXiv Detail & Related papers (2023-09-16T10:56:00Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z) - Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process [17.477718698071424]
We propose a new estimator that treats user satisfaction across turns as an event sequence and employs a Hawkes process to effectively model the dynamics in this sequence.
Experimental results on four benchmark dialogue datasets demonstrate that ASAP can substantially outperform state-of-the-art baseline estimators.
arXiv Detail & Related papers (2023-05-21T23:04:14Z) - Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with
User Simulator [37.590563896382456]
We propose an interactive evaluation framework for Task-Oriented Dialogue (TOD) systems.
We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues.
Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates.
arXiv Detail & Related papers (2022-10-26T07:41:32Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.