Vague Preference Policy Learning for Conversational Recommendation
- URL: http://arxiv.org/abs/2306.04487v6
- Date: Tue, 27 May 2025 07:38:18 GMT
- Title: Vague Preference Policy Learning for Conversational Recommendation
- Authors: Gangyi Zhang, Chongming Gao, Wenqiang Lei, Xiaojie Guo, Shijun Li, Hongshen Chen, Zhuozhi Ding, Sulong Xu, Lingfei Wu,
- Abstract summary: Conversational recommendation systems commonly assume users have clear preferences, leading to potential over-filtering.<n>We introduce the Vague Preference Multi-round Conversational Recommendation (VPMCR) scenario, employing a soft estimation mechanism to accommodate users' vague and dynamic preferences.<n>Our work advances CRS by accommodating users' inherent ambiguity and relative decision-making processes, improving real-world applicability.
- Score: 48.868921530958666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conversational recommendation systems (CRS) commonly assume users have clear preferences, leading to potential over-filtering of relevant alternatives. However, users often exhibit vague, non-binary preferences. We introduce the Vague Preference Multi-round Conversational Recommendation (VPMCR) scenario, employing a soft estimation mechanism to accommodate users' vague and dynamic preferences while mitigating over-filtering. In VPMCR, we propose Vague Preference Policy Learning (VPPL), consisting of Ambiguity-aware Soft Estimation (ASE) and Dynamism-aware Policy Learning (DPL). ASE captures preference vagueness by estimating scores for clicked and non-clicked options, using a choice-based approach and time-aware preference decay. DPL leverages ASE's preference distribution to guide the conversation and adapt to preference changes for recommendations or attribute queries. Extensive experiments demonstrate VPPL's effectiveness within VPMCR, outperforming existing methods and setting a new benchmark. Our work advances CRS by accommodating users' inherent ambiguity and relative decision-making processes, improving real-world applicability.
Related papers
- How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics [65.67654005892469]
We show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences.<n>We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies.<n>Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods.
arXiv Detail & Related papers (2026-02-12T17:11:08Z) - RecNet: Self-Evolving Preference Propagation for Agentic Recommender Systems [109.9061591263748]
RecNet is a self-evolving preference propagation framework for recommender systems.<n>It proactively propagates real-time preference updates across related users and items.<n>In the backward phase, the feedback-driven propagation optimization mechanism simulates a multi-agent reinforcement learning framework.
arXiv Detail & Related papers (2026-01-29T12:14:31Z) - Tree of Preferences for Diversified Recommendation [54.183647833064136]
We study diversified recommendation from a data-bias perspective.<n>Inspired by the outstanding performance of large language models (LLMs) in zero-shot inference leveraging world knowledge, we propose a novel approach.
arXiv Detail & Related papers (2025-12-24T04:13:17Z) - RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders [0.8246494848934447]
We propose a fine-tuning solution using human feedback reinforcement learning (RLHF) to maximize implied user feedback (IUF) in a multi-turn recommendation context.<n>We show that our RLHF-fine-tuned models can perform better in terms of top-$k$ recommendation accuracy, coherence, and user satisfaction compared to (arrow-zero-cmwrquca-teja-falset ensuite 2Round group-deca States penalty give up.
arXiv Detail & Related papers (2025-08-07T11:36:55Z) - Churn-Aware Recommendation Planning under Aggregated Preference Feedback [6.261444979025644]
We study a sequential decision-making problem motivated by recent regulatory and technological shifts.<n>We introduce the Rec-APC model, in which an anonymous user is drawn from a known prior over latent user types.<n>We prove that optimal policies converge to pure exploitation in finite time and propose a branch-and-bound algorithm to efficiently compute them.
arXiv Detail & Related papers (2025-07-06T19:22:47Z) - What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context [56.590259941275434]
RecPO is a preference optimization framework for sequential recommendation.<n>It exploits adaptive reward margins based on inferred preference hierarchies and temporal signals.<n>It mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
arXiv Detail & Related papers (2025-06-02T21:09:29Z) - Search-Based Interaction For Conversation Recommendation via Generative Reward Model Based Simulated User [117.82681846559909]
Conversational recommendation systems (CRSs) use multi-turn interaction to capture user preferences and provide personalized recommendations.
We propose a generative reward model based simulated user, named GRSU, for automatic interaction with CRSs.
arXiv Detail & Related papers (2025-04-29T06:37:30Z) - Empowering Retrieval-based Conversational Recommendation with Contrasting User Preferences [12.249992789091415]
We propose a novel conversational recommender model, called COntrasting user pReference expAnsion and Learning (CORAL)
CORAL extracts the user's hidden preferences through contrasting preference expansion.
It explicitly differentiates the contrasting preferences and leverages them into the recommendation process via preference-aware learning.
arXiv Detail & Related papers (2025-03-27T21:45:49Z) - Preference Discerning with LLM-Enhanced Generative Retrieval [28.309905847867178]
We propose a new paradigm, which we term preference discerning.<n>In preference dscerning, we explicitly condition a generative sequential recommendation system on user preferences within its context.<n>We generate user preferences using Large Language Models (LLMs) based on user reviews and item-specific data.
arXiv Detail & Related papers (2024-12-11T18:26:55Z) - Harm Mitigation in Recommender Systems under User Preference Dynamics [16.213153879446796]
We consider a recommender system that takes into account the interplay between recommendations, user interests, and harmful content.
We seek recommendation policies that establish a tradeoff between maximizing click-through rate (CTR) and mitigating harm.
arXiv Detail & Related papers (2024-06-14T09:52:47Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs.<n>Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Estimating and Penalizing Induced Preference Shifts in Recommender
Systems [10.052697877248601]
We argue that system designers should: estimate the shifts a recommender would induce; evaluate whether such shifts would be undesirable; and even actively optimize to avoid problematic shifts.
We do this by using historical user interaction data to train predictive user model which implicitly contains their preference dynamics.
In simulated experiments, we show that our learned preference dynamics model is effective in estimating user preferences and how they would respond to new recommenders.
arXiv Detail & Related papers (2022-04-25T21:04:46Z) - Reward Constrained Interactive Recommendation with Natural Language
Feedback [158.8095688415973]
We propose a novel constraint-augmented reinforcement learning (RL) framework to efficiently incorporate user preferences over time.
Specifically, we leverage a discriminator to detect recommendations violating user historical preference.
Our proposed framework is general and is further extended to the task of constrained text generation.
arXiv Detail & Related papers (2020-05-04T16:23:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.