Large Language Models as Evaluators for Conversational Recommender Systems: Benchmarking System Performance from a User-Centric Perspective
- URL: http://arxiv.org/abs/2501.09493v2
- Date: Tue, 18 Feb 2025 10:46:28 GMT
- Title: Large Language Models as Evaluators for Conversational Recommender Systems: Benchmarking System Performance from a User-Centric Perspective
- Authors: Nuo Chen, Quanyu Dai, Xiaoyu Dong, Xiao-Ming Wu, Zhenhua Dong,
- Abstract summary: This study proposes an automated LLM-based CRS evaluation framework.
It builds upon existing research in human-computer interaction and psychology.
We use this framework to evaluate four different conversational recommender systems.
- Score: 38.940283784200005
- License:
- Abstract: Conversational recommender systems (CRS) involve both recommendation and dialogue tasks, which makes their evaluation a unique challenge. Although past research has analyzed various factors that may affect user satisfaction with CRS interactions from the perspective of user studies, few evaluation metrics for CRS have been proposed. Recent studies have shown that LLMs can align with human preferences, and several LLM-based text quality evaluation measures have been introduced. However, the application of LLMs in CRS evaluation remains relatively limited. To address this research gap and advance the development of user-centric conversational recommender systems, this study proposes an automated LLM-based CRS evaluation framework, building upon existing research in human-computer interaction and psychology. The framework evaluates CRS from four dimensions: dialogue behavior, language expression, recommendation items, and response content. We use this framework to evaluate four different conversational recommender systems.
Related papers
- HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)
In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.
We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommender Systems [1.652907918484303]
Large Language Models (LLMs) have demonstrated great potential in Conversational Recommender Systems (CRS)
LLMs often appear inflexible and passive, frequently rushing to complete the recommendation task without sufficient inquiry.
This behavior discrepancy can lead to decreased accuracy in recommendations and lower user satisfaction.
arXiv Detail & Related papers (2024-04-17T21:56:27Z) - Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors [68.68418801681965]
We propose a new and inclusive evaluation protocol, Concept, which integrates both system- and user-centric factors.
Our protocol, Concept, serves a dual purpose. First, it provides an overview of the pros and cons in current CRS models.
Second, it pinpoints the problem of low usability in the "omnipotent" ChatGPT and offers a comprehensive reference guide for evaluating CRS.
arXiv Detail & Related papers (2024-04-04T08:56:48Z) - A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges.
We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels.
We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - INFACT: An Online Human Evaluation Framework for Conversational
Recommendation [5.837881923712394]
Conversational recommender systems (CRS) are interactive agents that support their users in recommendation-related goals through multi-turn conversations.
Current research on machine learning-based CRS models acknowledges the importance of humans in the evaluation process.
arXiv Detail & Related papers (2022-09-07T15:16:59Z) - MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for
Evaluating Open-Domain Dialogue [15.31433922183745]
We propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for evaluating open-domain dialogue.
MME-CRS ranks first on the final test data of DSTC10 track5 subtask1 Automatic Open-domain Dialogue Evaluation Challenge with a large margin.
arXiv Detail & Related papers (2022-06-19T13:43:59Z) - Deep Conversational Recommender Systems: A New Frontier for
Goal-Oriented Dialogue Systems [54.06971074217952]
Conversational Recommender System (CRS) learns and models user's preferences through interactive dialogue conversations.
Deep learning approaches are applied to CRS and have produced fruitful results.
arXiv Detail & Related papers (2020-04-28T02:20:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.