Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs
- URL: http://arxiv.org/abs/2404.12994v2
- Date: Mon, 29 Apr 2024 19:00:34 GMT
- Title: Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs
- Authors: Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke,
- Abstract summary: In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
- Score: 57.16442740983528
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.
Related papers
- CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems [60.27663010453209]
We leverage large language models (LLMs) to generate satisfaction-aware counterfactual dialogues.
We gather human annotations to ensure the reliability of the generated samples.
Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems.
arXiv Detail & Related papers (2024-03-27T23:45:31Z) - Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain
Dialogue Systems [14.98159964397052]
We analyzed and examined what features are needed in an automatic response evaluator from the interlocutor's perspective.
The first experiment on the Hazumi dataset revealed that interlocutor awareness plays a critical role in making automatic response evaluation correlate with the interlocutor's judgments.
The second experiment using massive conversations on X (formerly Twitter) confirmed that dialogue continuity prediction can train an interlocutor-aware response evaluator without human feedback.
arXiv Detail & Related papers (2024-01-04T13:15:41Z) - Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue
Evaluation [13.651502777079237]
This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups.
Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups.
arXiv Detail & Related papers (2023-09-14T19:19:50Z) - Continually Improving Extractive QA via Human Feedback [59.49549491725224]
We study continually improving an extractive question answering (QA) system via human user feedback.
We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time.
arXiv Detail & Related papers (2023-05-21T14:35:32Z) - Understanding How People Rate Their Conversations [73.17730062864314]
We conduct a study to better understand how people rate their interactions with conversational agents.
We focus on agreeableness and extraversion as variables that may explain variation in ratings.
arXiv Detail & Related papers (2022-06-01T00:45:32Z) - SIFN: A Sentiment-aware Interactive Fusion Network for Review-based Item
Recommendation [48.1799451277808]
We propose a Sentiment-aware Interactive Fusion Network (SIFN) for review-based item recommendation.
We first encode user/item reviews via BERT and propose a light-weighted sentiment learner to extract semantic features of each review.
Then, we propose a sentiment prediction task that guides the sentiment learner to extract sentiment-aware features via explicit sentiment labels.
arXiv Detail & Related papers (2021-08-18T08:04:38Z) - User and Item-aware Estimation of Review Helpfulness [4.640835690336653]
We investigate the role of deviations in the properties of reviews as helpfulness determinants.
We propose a novel helpfulness estimation model that extends previous ones.
Our model is thus an effective tool to select relevant user feedback for decision-making.
arXiv Detail & Related papers (2020-11-20T15:35:56Z) - Improving Conversational Question Answering Systems after Deployment
using Feedback-Weighted Learning [69.42679922160684]
We propose feedback-weighted learning based on importance sampling to improve upon an initial supervised system using binary user feedback.
Our work opens the prospect to exploit interactions with real users and improve conversational systems after deployment.
arXiv Detail & Related papers (2020-11-01T19:50:34Z) - Soliciting Human-in-the-Loop User Feedback for Interactive Machine
Learning Reduces User Trust and Impressions of Model Accuracy [8.11839312231511]
Mixed-initiative systems allow users to interactively provide feedback to improve system performance.
Our research investigates how the act of providing feedback can affect user understanding of an intelligent system and its accuracy.
arXiv Detail & Related papers (2020-08-28T16:46:41Z) - Enabling the Analysis of Personality Aspects in Recommender Systems [0.0]
Existing recommender systems mainly focus on exploiting users' feedback, e.g., ratings, and reviews on common items to detect similar users.
We call this problem the Data Sparsity With no Feedback on Common Items (DSW-n-FCI)
We identify users' personality type implicitly with no burden on users and incorporate it along with users' personal interests and their level of knowledge.
arXiv Detail & Related papers (2020-01-07T23:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.