Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommender Systems
- URL: http://arxiv.org/abs/2404.11773v2
- Date: Thu, 17 Oct 2024 18:59:18 GMT
- Title: Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommender Systems
- Authors: Dayu Yang, Fumian Chen, Hui Fang,
- Abstract summary: Large Language Models (LLMs) have demonstrated great potential in Conversational Recommender Systems (CRS)
LLMs often appear inflexible and passive, frequently rushing to complete the recommendation task without sufficient inquiry.
This behavior discrepancy can lead to decreased accuracy in recommendations and lower user satisfaction.
- Score: 1.652907918484303
- License:
- Abstract: Large Language Models (LLMs) have demonstrated great potential in Conversational Recommender Systems (CRS). However, the application of LLMs to CRS has exposed a notable discrepancy in behavior between LLM-based CRS and human recommenders: LLMs often appear inflexible and passive, frequently rushing to complete the recommendation task without sufficient inquiry.This behavior discrepancy can lead to decreased accuracy in recommendations and lower user satisfaction. Despite its importance, existing studies in CRS lack a study about how to measure such behavior discrepancy. To fill this gap, we propose Behavior Alignment, a new evaluation metric to measure how well the recommendation strategies made by a LLM-based CRS are consistent with human recommenders'. Our experiment results show that the new metric is better aligned with human preferences and can better differentiate how systems perform than existing evaluation metrics. As Behavior Alignment requires explicit and costly human annotations on the recommendation strategies, we also propose a classification-based method to implicitly measure the Behavior Alignment based on the responses. The evaluation results confirm the robustness of the method.
Related papers
- RLRF4Rec: Reinforcement Learning from Recsys Feedback for Enhanced Recommendation Reranking [33.54698201942643]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains.
This paper introduces RLRF4Rec, a novel framework integrating Reinforcement Learning from Recsys Feedback for Enhanced Recommendation Reranking.
arXiv Detail & Related papers (2024-10-08T11:42:37Z) - Large Language Model Empowered Embedding Generator for Sequential Recommendation [57.49045064294086]
Large Language Model (LLM) has the potential to understand the semantic connections between items, regardless of their popularity.
We present LLMEmb, an innovative technique that harnesses LLM to create item embeddings that bolster the performance of Sequential Recommender Systems.
arXiv Detail & Related papers (2024-09-30T03:59:06Z) - Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates [10.091146498861333]
Commercial large language models (LLMs) like GPT-4 have been recently employed to evaluate and compare different alignment approaches.
We develop a framework to evaluate, compare, and visualize the reliability and alignment of LLM judges.
arXiv Detail & Related papers (2024-08-23T11:49:01Z) - Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method [60.364834418531366]
We propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS.
We formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions.
We introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics.
arXiv Detail & Related papers (2024-08-19T07:21:02Z) - Large Language Models as Evaluators for Recommendation Explanations [23.938202791437337]
We investigate whether LLMs can serve as evaluators of recommendation explanations.
We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users.
Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts.
arXiv Detail & Related papers (2024-06-05T13:23:23Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities in assessing the quality of generated natural language.
LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.
We introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.