Related papers: A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via Large Language Models

A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via Large Language Models

URL: http://arxiv.org/abs/2508.17571v1
Date: Mon, 25 Aug 2025 00:45:16 GMT
Title: A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via Large Language Models
Authors: Yu Tokutake, Kazushi Okamoto, Kei Harada, Atsushi Shibata, Koki Karube,
Abstract summary: Serendipity in recommender systems (RSs) has attracted increasing attention as a concept that enhances user satisfaction by presenting unexpected and useful items.<n>The existing offline metrics often depend on ambiguous definitions or are tailored to specific datasets and RSs, thereby limiting their generalizability.<n>We propose a universally applicable evaluation framework that leverages large language models (LLMs) known for their extensive knowledge and reasoning capabilities, as evaluators.
Score: 0.6524460254566904
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Serendipity in recommender systems (RSs) has attracted increasing attention as a concept that enhances user satisfaction by presenting unexpected and useful items. However, evaluating serendipitous performance remains challenging because its ground truth is generally unobservable. The existing offline metrics often depend on ambiguous definitions or are tailored to specific datasets and RSs, thereby limiting their generalizability. To address this issue, we propose a universally applicable evaluation framework that leverages large language models (LLMs) known for their extensive knowledge and reasoning capabilities, as evaluators. First, to improve the evaluation performance of the proposed framework, we assessed the serendipity prediction accuracy of LLMs using four different prompt strategies on a dataset containing user-annotated serendipitous ground truth and found that the chain-of-thought prompt achieved the highest accuracy. Next, we re-evaluated the serendipitous performance of both serendipity-oriented and general RSs using the proposed framework on three commonly used real-world datasets, without the ground truth. The results indicated that there was no serendipity-oriented RS that consistently outperformed across all datasets, and even a general RS sometimes achieved higher performance than the serendipity-oriented RS.

Related papers

Enhancing Sequential Recommendation with World Knowledge from Large Language Models [35.436916487752285]
GRASP is a flexible framework that integrates generation augmented retrieval for synthesis and similarity retrieval.<n>The retrieved similar users/items serve as auxiliary contextual information for the later holistic attention enhancement module.<n>GraSP consistently achieves state-of-the-art performance when integrated with diverse backbones.
arXiv Detail & Related papers (2025-11-25T10:59:38Z)
Integrated Framework for LLM Evaluation with Answer Generation [0.0]
We propose an integrated evaluation framework called textitself-refining descriptive evaluation with expert-driven diagnostics, SPEED.<n> SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness.<n> Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets.
arXiv Detail & Related papers (2025-09-24T13:20:37Z)
Personalized Recommendations via Active Utility-based Pairwise Sampling [1.704905100460915]
We propose a utility-based framework that learns preferences from simple and intuitive pairwise comparisons.<n>A central contribution of our work is a novel utility-based active sampling strategy for preference elicitation.
arXiv Detail & Related papers (2025-08-12T19:09:33Z)
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG [51.120170062795566]
We propose Divide-Then-Align (DTA) to endow RAG systems with the ability to respond with "I don't know" when the query is out of the knowledge boundary.<n>DTA balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
arXiv Detail & Related papers (2025-05-27T08:21:21Z)
Large Language Model Empowered Recommendation Meets All-domain Continual Pre-Training [60.38082979765664]
CPRec is an All-domain Continual Pre-Training framework for Recommendation.<n>It holistically align LLMs with universal user behaviors through the continual pre-training paradigm.<n>We conduct experiments on five real-world datasets from two distinct platforms.
arXiv Detail & Related papers (2025-04-11T20:01:25Z)
Bursting Filter Bubble: Enhancing Serendipity Recommendations with Aligned Large Language Models [42.13005951072714]
Large language models (LLMs) have shown potential in serendipity prediction due to their extensive world knowledge and reasoning capabilities.<n>We propose SERAL, a framework comprising three stages: Cognition Profile Generation, SerenGPT Alignment, and Nearline Adaptation.<n>Online experiments demonstrate that SERAL improves exposure ratio (PVR), clicks, and transactions of serendipitous items by 5.7%, 29.56%, and 27.6%, enhancing user experience without much impact on overall revenue.
arXiv Detail & Related papers (2025-02-19T08:47:42Z)
The Role of Fake Users in Sequential Recommender Systems [0.0]
We assess how the presence of fake users, who engage in random interactions, follow popular or unpopular items, or focus on a single genre, impacts the performance of Sequential Recommender Systems (SRSs) While traditional metrics like NDCG remain relatively stable, our findings reveal that the presence of fake users severely degrades RLS metrics, often reducing them to near-zero values.
arXiv Detail & Related papers (2024-10-13T17:44:04Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.<n>Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.<n>This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.