Evaluating ChatGPT as a Recommender System: A Rigorous Approach
- URL: http://arxiv.org/abs/2309.03613v2
- Date: Tue, 4 Jun 2024 14:25:45 GMT
- Title: Evaluating ChatGPT as a Recommender System: A Rigorous Approach
- Authors: Dario Di Palma, Giovanni Maria Biancofiore, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia, Eugenio Di Sciascio,
- Abstract summary: We propose a robust evaluation pipeline to assess ChatGPT's ability as an RS and post-process ChatGPT recommendations.
We analyze the model's functionality in three settings: the Top-N Recommendation, the cold-start recommendation, and the re-ranking of a list of recommendations.
- Score: 12.458752059072706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have recently shown impressive abilities in handling various natural language-related tasks. Among different LLMs, current studies have assessed ChatGPT's superior performance across manifold tasks, especially under the zero/few-shot prompting conditions. Given such successes, the Recommender Systems (RSs) research community have started investigating its potential applications within the recommendation scenario. However, although various methods have been proposed to integrate ChatGPT's capabilities into RSs, current research struggles to comprehensively evaluate such models while considering the peculiarities of generative models. Often, evaluations do not consider hallucinations, duplications, and out-of-the-closed domain recommendations and solely focus on accuracy metrics, neglecting the impact on beyond-accuracy facets. To bridge this gap, we propose a robust evaluation pipeline to assess ChatGPT's ability as an RS and post-process ChatGPT recommendations to account for these aspects. Through this pipeline, we investigate ChatGPT-3.5 and ChatGPT-4 performance in the recommendation task under the zero-shot condition employing the role-playing prompt. We analyze the model's functionality in three settings: the Top-N Recommendation, the cold-start recommendation, and the re-ranking of a list of recommendations, and in three domains: movies, music, and books. The experiments reveal that ChatGPT exhibits higher accuracy than the baselines on books domain. It also excels in re-ranking and cold-start scenarios while maintaining reasonable beyond-accuracy metrics. Furthermore, we measure the similarity between the ChatGPT recommendations and the other recommenders, providing insights about how ChatGPT could be categorized in the realm of recommender systems. The evaluation pipeline is publicly released for future research.
Related papers
- Using ChatGPT to Score Essays and Short-Form Constructed Responses [0.0]
Investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost.
ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics.
Study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments.
arXiv Detail & Related papers (2024-08-18T16:51:28Z) - Aligning GPTRec with Beyond-Accuracy Goals with Reinforcement Learning [67.71952251641545]
GPTRec is an alternative to the Top-K model for item-by-item recommendations.
We show that GPTRec offers a better tradeoff between accuracy and secondary metrics than classic greedy re-ranking techniques.
Our experiments on two datasets show that GPTRec's Next-K generation approach offers a better tradeoff between accuracy and secondary metrics than classic greedy re-ranking techniques.
arXiv Detail & Related papers (2024-03-07T19:47:48Z) - ChatGPT for Conversational Recommendation: Refining Recommendations by
Reprompting with Feedback [1.3654846342364308]
Large Language Models (LLMs) like ChatGPT have gained popularity due to their ease of use and their ability to adapt dynamically to various tasks while responding to feedback.
We build a rigorous pipeline around ChatGPT to simulate how a user might realistically probe the model for recommendations.
We explore the effect of popularity bias in ChatGPT's recommendations, and compare its performance to baseline models.
arXiv Detail & Related papers (2024-01-07T23:17:42Z) - Towards LLM-driven Dialogue State Tracking [13.679946384741008]
Large language models (LLMs) such as GPT3 and ChatGPT have sparked considerable interest in assessing their efficacy across diverse applications.
We present LDST, an LLM-driven Dialogue State Tracking framework based on smaller, open-source foundation models.
We find that LDST exhibits remarkable performance improvements in both zero-shot and few-shot setting compared to previous SOTA methods.
arXiv Detail & Related papers (2023-10-23T14:15:28Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large
Language Model Recommendation [52.62492168507781]
We propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM)
This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes.
By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations.
arXiv Detail & Related papers (2023-05-12T16:54:36Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Exploring ChatGPT's Ability to Rank Content: A Preliminary Study on
Consistency with Human Preferences [6.821378903525802]
ChatGPT has consistently demonstrated a remarkable level of accuracy and reliability in terms of content evaluation.
A test set consisting of prompts is created, covering a wide range of use cases, and five models are utilized to generate corresponding responses.
Results on the test set show that ChatGPT's ranking preferences are consistent with human to a certain extent.
arXiv Detail & Related papers (2023-03-14T03:13:02Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.