Large Language Models as Evaluators for Conversational Recommender Systems: Benchmarking System Performance from a User-Centric Perspective
- URL: http://arxiv.org/abs/2501.09493v2
- Date: Tue, 18 Feb 2025 10:46:28 GMT
- Title: Large Language Models as Evaluators for Conversational Recommender Systems: Benchmarking System Performance from a User-Centric Perspective
- Authors: Nuo Chen, Quanyu Dai, Xiaoyu Dong, Xiao-Ming Wu, Zhenhua Dong,
- Abstract summary: This study proposes an automated LLM-based CRS evaluation framework.<n>It builds upon existing research in human-computer interaction and psychology.<n>We use this framework to evaluate four different conversational recommender systems.
- Score: 38.940283784200005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conversational recommender systems (CRS) involve both recommendation and dialogue tasks, which makes their evaluation a unique challenge. Although past research has analyzed various factors that may affect user satisfaction with CRS interactions from the perspective of user studies, few evaluation metrics for CRS have been proposed. Recent studies have shown that LLMs can align with human preferences, and several LLM-based text quality evaluation measures have been introduced. However, the application of LLMs in CRS evaluation remains relatively limited. To address this research gap and advance the development of user-centric conversational recommender systems, this study proposes an automated LLM-based CRS evaluation framework, building upon existing research in human-computer interaction and psychology. The framework evaluates CRS from four dimensions: dialogue behavior, language expression, recommendation items, and response content. We use this framework to evaluate four different conversational recommender systems.
Related papers
- Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models [70.180385882195]
This paper introduces a personality-aware user simulation for Conversational Recommender Systems (CRSs)
The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs.
Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits.
arXiv Detail & Related papers (2025-04-09T13:21:17Z) - Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.
We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z) - Graph Retrieval-Augmented LLM for Conversational Recommendation Systems [52.35491420330534]
G-CRS (Graph Retrieval-Augmented Large Language Model for Conversational Recommender Systems) is a training-free framework that combines graph retrieval-augmented generation and in-context learning.
G-CRS achieves superior recommendation performance compared to existing methods without requiring task-specific training.
arXiv Detail & Related papers (2025-03-09T03:56:22Z) - Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommender Systems [1.652907918484303]
Large Language Models (LLMs) have demonstrated great potential in Conversational Recommender Systems (CRS)
LLMs often appear inflexible and passive, frequently rushing to complete the recommendation task without sufficient inquiry.
This behavior discrepancy can lead to decreased accuracy in recommendations and lower user satisfaction.
arXiv Detail & Related papers (2024-04-17T21:56:27Z) - Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors [68.68418801681965]
We propose a new and inclusive evaluation protocol, Concept, which integrates both system- and user-centric factors.
Our protocol, Concept, serves a dual purpose. First, it provides an overview of the pros and cons in current CRS models.
Second, it pinpoints the problem of low usability in the "omnipotent" ChatGPT and offers a comprehensive reference guide for evaluating CRS.
arXiv Detail & Related papers (2024-04-04T08:56:48Z) - A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges.
We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels.
We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - INFACT: An Online Human Evaluation Framework for Conversational
Recommendation [5.837881923712394]
Conversational recommender systems (CRS) are interactive agents that support their users in recommendation-related goals through multi-turn conversations.
Current research on machine learning-based CRS models acknowledges the importance of humans in the evaluation process.
arXiv Detail & Related papers (2022-09-07T15:16:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.