INFACT: An Online Human Evaluation Framework for Conversational
Recommendation
- URL: http://arxiv.org/abs/2209.03213v1
- Date: Wed, 7 Sep 2022 15:16:59 GMT
- Title: INFACT: An Online Human Evaluation Framework for Conversational
Recommendation
- Authors: Ahtsham Manzoor, Dietmar jannach
- Abstract summary: Conversational recommender systems (CRS) are interactive agents that support their users in recommendation-related goals through multi-turn conversations.
Current research on machine learning-based CRS models acknowledges the importance of humans in the evaluation process.
- Score: 5.837881923712394
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Conversational recommender systems (CRS) are interactive agents that support
their users in recommendation-related goals through multi-turn conversations.
Generally, a CRS can be evaluated in various dimensions. Today's CRS mainly
rely on offline(computational) measures to assess the performance of their
algorithms in comparison to different baselines. However, offline measures can
have limitations, for example, when the metrics for comparing a newly generated
response with a ground truth do not correlate with human perceptions, because
various alternative generated responses might be suitable too in a given dialog
situation. Current research on machine learning-based CRS models therefore
acknowledges the importance of humans in the evaluation process, knowing that
pure offline measures may not be sufficient in evaluating a highly interactive
system like a CRS.
Related papers
- Stop Playing the Guessing Game! Target-free User Simulation for Evaluating Conversational Recommender Systems [15.481944998961847]
PEPPER is an evaluation protocol with target-free user simulators constructed from real-user interaction histories and reviews.
PEPPER enables realistic user-CRS dialogues without falling into simplistic guessing games.
PEPPER presents detailed measures for comprehensively evaluating the preference elicitation capabilities of CRSs.
arXiv Detail & Related papers (2024-11-25T07:36:20Z) - PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations.
We show that PairEval exhibits a higher correlation with human judgments than baseline metrics.
We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z) - A Conversation is Worth A Thousand Recommendations: A Survey of Holistic
Conversational Recommender Systems [54.78815548652424]
Conversational recommender systems generate recommendations through an interactive process.
Not all CRS approaches use human conversations as their source of interaction data.
holistic CRS are trained using conversational data collected from real-world scenarios.
arXiv Detail & Related papers (2023-09-14T12:55:23Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - Dialogue Evaluation with Offline Reinforcement Learning [2.580163308334609]
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions.
They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase.
We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
arXiv Detail & Related papers (2022-09-02T08:32:52Z) - DEAM: Dialogue Coherence Evaluation using AMR-based Semantic
Manipulations [46.942369532632604]
We propose a Dialogue Evaluation metric that relies on AMR-based semantic manipulations for incoherent data generation.
Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods.
arXiv Detail & Related papers (2022-03-18T03:11:35Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.