Related papers: Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

URL: http://arxiv.org/abs/2508.08777v1
Date: Tue, 12 Aug 2025 09:23:35 GMT
Title: Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
Authors: Francesco Fabbri, Gustavo Penha, Edoardo D'Amico, Alice Wang, Marco De Nadai, Jackie Doremus, Paul Gigioli, Andreas Damianou, Oskar Stal, Mounia Lalmas,
Abstract summary: We propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations.<n>Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history.<n>In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories.
Score: 8.554894195710204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context-enabling the LLM to reason more effectively about alignment between a user's interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems.

Related papers

Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation? [40.49875426230813]
This paper investigates whether Large Language Models (LLMs) can serve as reliable automatic judges to address scalability challenges.<n>Using the ML-32M-ext Cranfield-style movie recommendation collection, we first examine the limitations of existing evaluation methodologies.<n>We find that incorporating richer item metadata and longer user histories improves alignment, and that LLM-judge yields high agreement with human-based rankings.
arXiv Detail & Related papers (2025-11-28T16:10:39Z)
Biases in LLM-Generated Musical Taste Profiles for Recommendation [6.482557558168364]
Large Language Models (LLMs) for recommendation can generate Natural Language (NL) user taste profiles from consumption data.<n>But it remains unclear whether users consider these profiles to be an accurate representation of their taste.<n>We study this issue in the context of music streaming, where personalization is challenged by a large and culturally diverse catalog.
arXiv Detail & Related papers (2025-07-22T15:44:10Z)
Towards Explainable Temporal User Profiling with LLMs [3.719862246745416]
We leverage large language models (LLMs) to generate natural language summaries of users' interaction histories.<n>Our framework not only models temporal user preferences but also produces natural language profiles that can be used to explain recommendations in an interpretable manner.
arXiv Detail & Related papers (2025-05-01T22:02:46Z)
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale [53.059480071818136]
Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks.<n> PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories.<n>We evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile.
arXiv Detail & Related papers (2025-04-19T08:16:10Z)
Mind the Gap! Static and Interactive Evaluations of Large Audio Models [55.87220295533817]
Large Audio Models (LAMs) are designed to power voice-native experiences.<n>This study introduces an interactive approach to evaluate LAMs and collect 7,500 LAM interactions from 484 participants.
arXiv Detail & Related papers (2025-02-21T20:29:02Z)
RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models [40.74293642666989]
We present the idea of RecSys Arena, where the recommendation results given by two different recommender systems are evaluated by an LLM judger to obtain fine-grained evaluation feedback.<n>We demonstrate that many different LLMs provide general evaluation results that are highly consistent with canonical offline metrics.<n>It can better distinguish different algorithms with comparable performance in terms of AUC and nDCG.
arXiv Detail & Related papers (2024-12-15T05:57:36Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences.<n>Our key idea is leveraging the human prior knowledge within the small (seed) data.<n>We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z)
Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.<n>We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.<n>We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.<n>Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
Recommendations by Concise User Profiles from Review Text [24.408292545170944]
This work addresses the difficult and underexplored case of users who have very sparse interactions but post informative review texts.<n> feeding the full text of all reviews through an LLM has a weak signal-to-noise ratio and incurs high costs of processed tokens.<n>It presents a light-weight framework, called CUP, which first computes concise user profiles and feeds only these into the training of transformer-based recommenders.
arXiv Detail & Related papers (2023-11-02T15:31:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.