FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems
- URL: http://arxiv.org/abs/2506.00314v1
- Date: Fri, 30 May 2025 23:54:13 GMT
- Title: FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems
- Authors: Hideaki Joko, Faegheh Hasibi,
- Abstract summary: This work proposes FACE: a Fine-grained, Aspect-based Conversation Evaluation method.<n>It provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations.<n> FACE is reference-free and shows strong correlation with human judgments.
- Score: 4.028503203417233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A systematic, reliable, and low-cost evaluation of Conversational Recommender Systems (CRSs) remains an open challenge. Existing automatic CRS evaluation methods are proven insufficient for evaluating the dynamic nature of recommendation conversations. This work proposes FACE: a Fine-grained, Aspect-based Conversation Evaluation method that provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations. FACE is reference-free and shows strong correlation with human judgments, achieving system correlation of 0.9 and turn/dialogue-level of 0.5, outperforming state-of-the-art CRS evaluation methods by a large margin. Additionally, unlike existing LLM-based methods that provide single uninterpretable scores, FACE provides insights into the system performance and enables identifying and locating problems within conversations.
Related papers
- Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges [22.7340872046127]
We propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model.<n>Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment.
arXiv Detail & Related papers (2025-08-01T09:26:01Z) - Large Language Models as Evaluators for Conversational Recommender Systems: Benchmarking System Performance from a User-Centric Perspective [38.940283784200005]
This study proposes an automated LLM-based CRS evaluation framework.<n>It builds upon existing research in human-computer interaction and psychology.<n>We use this framework to evaluate four different conversational recommender systems.
arXiv Detail & Related papers (2025-01-16T12:06:56Z) - Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method [60.364834418531366]
We propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS.
We formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions.
We introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics.
arXiv Detail & Related papers (2024-08-19T07:21:02Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - A Conversation is Worth A Thousand Recommendations: A Survey of Holistic
Conversational Recommender Systems [54.78815548652424]
Conversational recommender systems generate recommendations through an interactive process.
Not all CRS approaches use human conversations as their source of interaction data.
holistic CRS are trained using conversational data collected from real-world scenarios.
arXiv Detail & Related papers (2023-09-14T12:55:23Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - INFACT: An Online Human Evaluation Framework for Conversational
Recommendation [5.837881923712394]
Conversational recommender systems (CRS) are interactive agents that support their users in recommendation-related goals through multi-turn conversations.
Current research on machine learning-based CRS models acknowledges the importance of humans in the evaluation process.
arXiv Detail & Related papers (2022-09-07T15:16:59Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.