What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders
- URL: http://arxiv.org/abs/2401.11632v2
- Date: Wed, 1 May 2024 17:55:28 GMT
- Title: What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders
- Authors: Ruixuan Sun, Xinyi Wu, Avinash Akella, Ruoyan Kong, Bart Knijnenburg, Joseph A. Konstan,
- Abstract summary: We conduct a human-centric evaluation case study of four leading DL-RecSys models in the movie domain.
We test how different DL-RecSys models perform in personalized recommendation generation by conducting survey study with 445 real active users.
We find some DL-RecSys models to be superior in recommending novel and unexpected items and weaker in diversity, trustworthiness, transparency, accuracy, and overall user satisfaction.
- Score: 12.132920692489911
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the past decade, deep learning (DL) models have gained prominence for their exceptional accuracy on benchmark datasets in recommender systems (RecSys). However, their evaluation has primarily relied on offline metrics, overlooking direct user perception and experience. To address this gap, we conduct a human-centric evaluation case study of four leading DL-RecSys models in the movie domain. We test how different DL-RecSys models perform in personalized recommendation generation by conducting survey study with 445 real active users. We find some DL-RecSys models to be superior in recommending novel and unexpected items and weaker in diversity, trustworthiness, transparency, accuracy, and overall user satisfaction compared to classic collaborative filtering (CF) methods. To further explain the reasons behind the underperformance, we apply a comprehensive path analysis. We discover that the lack of diversity and too much serendipity from DL models can negatively impact the consequent perceived transparency and personalization of recommendations. Such a path ultimately leads to lower summative user satisfaction. Qualitatively, we confirm with real user quotes that accuracy plus at least one other attribute is necessary to ensure a good user experience, while their demands for transparency and trust can not be neglected. Based on our findings, we discuss future human-centric DL-RecSys design and optimization strategies.
Related papers
- CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence [55.21518669075263]
CURE4Rec is the first comprehensive benchmark for recommendation unlearning evaluation.
We consider the deeper influence of unlearning on recommendation fairness and robustness towards data with varying impact levels.
arXiv Detail & Related papers (2024-08-26T16:21:50Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback [110.16220825629749]
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models.
In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts.
Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
arXiv Detail & Related papers (2024-06-13T16:17:21Z) - Large Language Models as Conversational Movie Recommenders: A User Study [3.3636849604467]
Large language models (LLMs) offer strong recommendation explainability but lack overall personalization, diversity, and user trust.
LLMs show a greater ability to recommend lesser-known or niche movies.
arXiv Detail & Related papers (2024-04-29T20:17:06Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Uncertainty-Aware Explainable Recommendation with Large Language Models [15.229417987212631]
We develop a model that utilizes the ID vectors of user and item inputs as prompts for GPT-2.
We employ a joint training mechanism within a multi-task learning framework to optimize both the recommendation task and explanation task.
Our method achieves 1.59 DIV, 0.57 USR and 0.41 FCR on the Yelp, TripAdvisor and Amazon dataset respectively.
arXiv Detail & Related papers (2024-01-31T14:06:26Z) - Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large
Language Model Recommendation [52.62492168507781]
We propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM)
This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes.
By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations.
arXiv Detail & Related papers (2023-05-12T16:54:36Z) - Personalizing Intervened Network for Long-tailed Sequential User
Behavior Modeling [66.02953670238647]
Tail users suffer from significantly lower-quality recommendation than the head users after joint training.
A model trained on tail users separately still achieve inferior results due to limited data.
We propose a novel approach that significantly improves the recommendation performance of the tail users.
arXiv Detail & Related papers (2022-08-19T02:50:19Z) - CausPref: Causal Preference Learning for Out-of-Distribution
Recommendation [36.22965012642248]
The current recommender system is still vulnerable to the distribution shift of users and items in realistic scenarios.
We propose to incorporate the recommendation-specific DAG learner into a novel causal preference-based recommendation framework named CausPref.
Our approach surpasses the benchmark models significantly under types of out-of-distribution settings.
arXiv Detail & Related papers (2022-02-08T16:42:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.