Related papers: Limitations of Current Evaluation Practices for Conversational Recommender Systems and the Potential of User Simulation

Limitations of Current Evaluation Practices for Conversational Recommender Systems and the Potential of User Simulation

URL: http://arxiv.org/abs/2510.05624v1
Date: Tue, 07 Oct 2025 07:12:47 GMT
Title: Limitations of Current Evaluation Practices for Conversational Recommender Systems and the Potential of User Simulation
Authors: Nolwenn Bernard, Krisztian Balog,
Abstract summary: This paper critically examines current evaluation practices for conversational recommender systems (CRSs)<n>We identify two key limitations: the over-reliance on static test collections and the inadequacy of existing evaluation metrics.<n>We propose novel evaluation metrics, based on a general reward/cost framework, designed to better align with real user satisfaction.
Score: 19.14733504795247
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Research and development on conversational recommender systems (CRSs) critically depends on sound and reliable evaluation methodologies. However, the interactive nature of these systems poses significant challenges for automatic evaluation. This paper critically examines current evaluation practices and identifies two key limitations: the over-reliance on static test collections and the inadequacy of existing evaluation metrics. To substantiate this critique, we analyze real user interactions with nine existing CRSs and demonstrate a striking disconnect between self-reported user satisfaction and performance scores reported in prior literature. To address these limitations, this work explores the potential of user simulation to generate dynamic interaction data, offering a departure from static datasets. Furthermore, we propose novel evaluation metrics, based on a general reward/cost framework, designed to better align with real user satisfaction. Our analysis of different simulation approaches provides valuable insights into their effectiveness and reveals promising initial results, showing improved correlation with system rankings compared to human evaluation. While these findings indicate a significant step forward in CRS evaluation, we also identify areas for future research and refinement in both simulation techniques and evaluation metrics.

Related papers

On the Reliability of User-Centric Evaluation of Conversational Recommender Systems [0.9112926574395824]
We present a large-scale empirical study investigating the reliability of user-centric CRS evaluation on static dialogue transcripts.<n>We collect 1,053 annotations from 124 crowd workers on 200 ReDial dialogues using the 18-dimensional CRS-Que framework.<n>Our results show that utilitarian and outcome-oriented dimensions such as accuracy, usefulness, and satisfaction achieve moderate reliability under aggregation.<n>Many dimensions collapse into a single global quality signal, revealing a strong halo effect in third-party judgments.
arXiv Detail & Related papers (2026-02-19T11:10:11Z)
CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization [53.79487826635141]
Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers.<n>But it struggles with open-ended subjective tasks like role-playing dialogue.<n>Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.<n>Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization.
arXiv Detail & Related papers (2025-08-12T16:49:18Z)
Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems [6.540293515339111]
offline evaluation of retrieval-ranking systems is crucial for developing high-performing models.<n>We propose a novel framework for robust offline evaluation of retrieval-ranking systems.<n>Our contributions include (1) a causal formulation for addressing offline evaluation biases, (2) a system-agnostic debiasing framework, and (3) empirical validation of its effectiveness.
arXiv Detail & Related papers (2025-04-04T23:52:57Z)
The simulation of judgment in LLMs [32.57692724251287]
Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments.<n>This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans.<n>We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check--and against human judgments collected through a controlled experiment.
arXiv Detail & Related papers (2025-02-06T18:52:10Z)
Stop Playing the Guessing Game! Target-free User Simulation for Evaluating Conversational Recommender Systems [21.275452863162936]
PEPPER is an evaluation protocol with target-free user simulators constructed from real-user interaction histories and reviews.<n> PEPPER enables realistic user-CRS dialogues without falling into simplistic guessing games.<n> PEPPER presents detailed measures for comprehensively evaluating the preference elicitation capabilities of CRSs.
arXiv Detail & Related papers (2024-11-25T07:36:20Z)
Pessimistic Evaluation [58.736490198613154]
We argue that evaluating information access systems assumes utilitarian values not aligned with traditions of information access based on equal access. We advocate for pessimistic evaluation of information access systems focusing on worst case utility.
arXiv Detail & Related papers (2024-10-17T15:40:09Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs) In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.