Related papers: Evaluating Conversational Recommender Systems via Large Language Models: A User-Centric Framework

Evaluating Conversational Recommender Systems via Large Language Models: A User-Centric Framework

URL: http://arxiv.org/abs/2501.09493v3
Date: Mon, 21 Jul 2025 10:23:44 GMT
Title: Evaluating Conversational Recommender Systems via Large Language Models: A User-Centric Framework
Authors: Nuo Chen, Quanyu Dai, Xiaoyu Dong, Piaohong Wang, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Xiao-Ming Wu,
Abstract summary: Conversational recommender systems (CRSs) integrate both recommendation and dialogue tasks.<n>Existing approaches primarily assess CRS performance by separately evaluating item recommendation and dialogue management using rule-based metrics.<n>We propose a user-centric evaluation framework based on large language models (LLMs) for CRSs, namely Conversational Recommendation Evaluator (CoRE)
Score: 35.20623751587154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conversational recommender systems (CRSs) integrate both recommendation and dialogue tasks, making their evaluation uniquely challenging. Existing approaches primarily assess CRS performance by separately evaluating item recommendation and dialogue management using rule-based metrics. However, these methods fail to capture the real human experience, and they cannot draw direct conclusions about the system's overall performance. As conversational recommender systems become increasingly vital in e-commerce, social media, and customer support, the ability to evaluate both recommendation accuracy and dialogue management quality using a single metric, thereby authentically reflecting user experience, has become the principal challenge impeding progress in this field. In this work, we propose a user-centric evaluation framework based on large language models (LLMs) for CRSs, namely Conversational Recommendation Evaluator (CoRE). CoRE consists of two main components: (1) LLM-As-Evaluator. Firstly, we comprehensively summarize 12 key factors influencing user experience in CRSs and directly leverage LLM as an evaluator to assign a score to each factor. (2) Multi-Agent Debater. Secondly, we design a multi-agent debate framework with four distinct roles (common user, domain expert, linguist, and HCI expert) to discuss and synthesize the 12 evaluation factors into a unified overall performance score. Furthermore, we apply the proposed framework to evaluate four CRSs on two benchmark datasets. The experimental results show that CoRE aligns well with human evaluation in most of the 12 factors and the overall assessment. Especially, CoRE's overall evaluation scores demonstrate significantly better alignment with human feedback compared to existing rule-based metrics.

Related papers

Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges [22.7340872046127]
We propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model.<n>Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment.
arXiv Detail & Related papers (2025-08-01T09:26:01Z)
FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems [4.028503203417233]
This work proposes FACE: a Fine-grained, Aspect-based Conversation Evaluation method.<n>It provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations.<n> FACE is reference-free and shows strong correlation with human judgments.
arXiv Detail & Related papers (2025-05-30T23:54:13Z)
Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models [70.180385882195]
This paper introduces a personality-aware user simulation for Conversational Recommender Systems (CRSs) The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs. Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits.
arXiv Detail & Related papers (2025-04-09T13:21:17Z)
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z)
Graph Retrieval-Augmented LLM for Conversational Recommendation Systems [52.35491420330534]
G-CRS (Graph Retrieval-Augmented Large Language Model for Conversational Recommender Systems) is a training-free framework that combines graph retrieval-augmented generation and in-context learning. G-CRS achieves superior recommendation performance compared to existing methods without requiring task-specific training.
arXiv Detail & Related papers (2025-03-09T03:56:22Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method [60.364834418531366]
We propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS. We formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions. We introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics.
arXiv Detail & Related papers (2024-08-19T07:21:02Z)
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z)
Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommender Systems [1.652907918484303]
Large Language Models (LLMs) have demonstrated great potential in Conversational Recommender Systems (CRS) LLMs often appear inflexible and passive, frequently rushing to complete the recommendation task without sufficient inquiry. This behavior discrepancy can lead to decreased accuracy in recommendations and lower user satisfaction.
arXiv Detail & Related papers (2024-04-17T21:56:27Z)
Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors [68.68418801681965]
We propose a new and inclusive evaluation protocol, Concept, which integrates both system- and user-centric factors. Our protocol, Concept, serves a dual purpose. First, it provides an overview of the pros and cons in current CRS models. Second, it pinpoints the problem of low usability in the "omnipotent" ChatGPT and offers a comprehensive reference guide for evaluating CRS.
arXiv Detail & Related papers (2024-04-04T08:56:48Z)
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges. We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels. We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z)
Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation [13.651502777079237]
This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups. Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups.
arXiv Detail & Related papers (2023-09-14T19:19:50Z)
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs) In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z)
INFACT: An Online Human Evaluation Framework for Conversational Recommendation [5.837881923712394]
Conversational recommender systems (CRS) are interactive agents that support their users in recommendation-related goals through multi-turn conversations. Current research on machine learning-based CRS models acknowledges the importance of humans in the evaluation process.
arXiv Detail & Related papers (2022-09-07T15:16:59Z)
Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings. Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.