Related papers: Leveraging the Power of Conversations: Optimal Key Term Selection in Conversational Contextual Bandits

Leveraging the Power of Conversations: Optimal Key Term Selection in Conversational Contextual Bandits

URL: http://arxiv.org/abs/2505.21393v1
Date: Tue, 27 May 2025 16:22:32 GMT
Title: Leveraging the Power of Conversations: Optimal Key Term Selection in Conversational Contextual Bandits
Authors: Maoli Liu, Zhuohua Li, Xiangxiang Dai, John C. S. Lui,
Abstract summary: Conversational recommender systems proactively query users with relevant "key terms" and leverage the feedback to elicit users' preferences for personalized recommendations.<n>Existing algorithms employ key term selection strategies with insufficient exploration, often failing to thoroughly probe users' preferences and resulting in suboptimal preference estimation.<n>We propose three novel algorithms: CLiSK, CLiME, and CLiSK-ME.<n>We theoretically prove that all three algorithms achieve a tighter regret upper bound of $O(sqrtdTlogT)$ with respect to the time horizon $T$, improving upon existing methods.
Score: 27.62165569135504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational recommender systems proactively query users with relevant "key terms" and leverage the feedback to elicit users' preferences for personalized recommendations. Conversational contextual bandits, a prevalent approach in this domain, aim to optimize preference learning by balancing exploitation and exploration. However, several limitations hinder their effectiveness in real-world scenarios. First, existing algorithms employ key term selection strategies with insufficient exploration, often failing to thoroughly probe users' preferences and resulting in suboptimal preference estimation. Second, current algorithms typically rely on deterministic rules to initiate conversations, causing unnecessary interactions when preferences are well-understood and missed opportunities when preferences are uncertain. To address these limitations, we propose three novel algorithms: CLiSK, CLiME, and CLiSK-ME. CLiSK introduces smoothed key term contexts to enhance exploration in preference learning, CLiME adaptively initiates conversations based on preference uncertainty, and CLiSK-ME integrates both techniques. We theoretically prove that all three algorithms achieve a tighter regret upper bound of $O(\sqrt{dT\log{T}})$ with respect to the time horizon $T$, improving upon existing methods. Additionally, we provide a matching lower bound $\Omega(\sqrt{dT})$ for conversational bandits, demonstrating that our algorithms are nearly minimax optimal. Extensive evaluations on both synthetic and real-world datasets show that our approaches achieve at least a 14.6% improvement in cumulative regret.

Related papers

Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback [34.7178680288326]
Stradi et al.(2024) proposed the first best-of-both-worlds algorithm for constrained Markov decision processes.<n>In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback.<n>Our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
arXiv Detail & Related papers (2024-10-03T07:44:40Z)
Beyond Primal-Dual Methods in Bandits with Stochastic and Adversarial Constraints [29.514323697659613]
We address a generalization of the bandit with knapsacks problem, where a learner aims to maximize rewards while satisfying an arbitrary set of long-term constraints. Our goal is to design best-of-both-worlds algorithms that perform under both and adversarial constraints.
arXiv Detail & Related papers (2024-05-25T08:09:36Z)
Stop Relying on No-Choice and Do not Repeat the Moves: Optimal, Efficient and Practical Algorithms for Assortment Optimization [38.57171985309975]
We develop efficient algorithms for the problem of regret in assortment selection with emphPlackett Luce (PL) based user choices. Our methods are practical, provably optimal, and devoid of the aforementioned limitations of the existing methods.
arXiv Detail & Related papers (2024-02-29T07:17:04Z)
Expert with Clustering: Hierarchical Online Preference Learning Framework [4.05836962263239]
Expert with Clustering (EWC) is a hierarchical contextual bandit framework that integrates clustering techniques and prediction with expert advice. EWC can substantially reduce regret by 27.57% compared to the LinUCB baseline.
arXiv Detail & Related papers (2024-01-26T18:44:49Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Rate-Optimal Policy Optimization for Linear Markov Decision Processes [65.5958446762678]
We obtain rate-optimal $widetilde O (sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.$K$) rate of convergence in the setting with bandit feedback. No algorithm with an optimal rate guarantee is currently known.
arXiv Detail & Related papers (2023-08-28T15:16:09Z)
Vague Preference Policy Learning for Conversational Recommendation [48.868921530958666]
Conversational recommendation systems commonly assume users have clear preferences, leading to potential over-filtering.<n>We introduce the Vague Preference Multi-round Conversational Recommendation (VPMCR) scenario, employing a soft estimation mechanism to accommodate users' vague and dynamic preferences.<n>Our work advances CRS by accommodating users' inherent ambiguity and relative decision-making processes, improving real-world applicability.
arXiv Detail & Related papers (2023-06-07T14:57:21Z)
Efficient Explorative Key-term Selection Strategies for Conversational Contextual Bandits [46.49854998602084]
We propose ConLinUCB", a general framework for conversational bandits with better information incorporation. We also design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR. Experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54% improvement) and computational efficiency (up to 72% improvement)
arXiv Detail & Related papers (2023-03-01T08:24:54Z)
Personalized Algorithmic Recourse with Preference Elicitation [20.78332455864586]
We introduce PEAR, the first human-in-the-loop approach capable of providing personalized algorithmic recourse tailored to the needs of any end-user. PEAR builds on insights from Bayesian Preference Elicitation to iteratively refine an estimate of the costs of actions by asking choice set queries to the target user. Our empirical evaluation on real-world datasets highlights how PEAR produces high-quality personalized recourse in only a handful of iterations.
arXiv Detail & Related papers (2022-05-27T03:12:18Z)
Bayesian decision-making under misspecified priors with applications to meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits. We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z)
Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.