Related papers: Efficient Explorative Key-term Selection Strategies for Conversational Contextual Bandits

Efficient Explorative Key-term Selection Strategies for Conversational Contextual Bandits

URL: http://arxiv.org/abs/2303.00315v2
Date: Sun, 1 Oct 2023 08:13:39 GMT
Title: Efficient Explorative Key-term Selection Strategies for Conversational Contextual Bandits
Authors: Zhiyong Wang, Xutong Liu, Shuai Li, John C.S. Lui
Abstract summary: We propose ConLinUCB", a general framework for conversational bandits with better information incorporation. We also design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR. Experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54% improvement) and computational efficiency (up to 72% improvement)
Score: 46.49854998602084
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational contextual bandits elicit user preferences by occasionally querying for explicit feedback on key-terms to accelerate learning. However, there are aspects of existing approaches which limit their performance. First, information gained from key-term-level conversations and arm-level recommendations is not appropriately incorporated to speed up learning. Second, it is important to ask explorative key-terms to quickly elicit the user's potential interests in various domains to accelerate the convergence of user preference estimation, which has never been considered in existing works. To tackle these issues, we first propose ``ConLinUCB", a general framework for conversational bandits with better information incorporation, combining arm-level and key-term-level feedback to estimate user preference in one step at each time. Based on this framework, we further design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR. We prove tighter regret upper bounds of our proposed algorithms. Particularly, ConLinUCB-BS achieves a regret bound of $O(d\sqrt{T\log T})$, better than the previous result $O(d\sqrt{T}\log T)$. Extensive experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54\% improvement) and computational efficiency (up to 72\% improvement), compared to the classic ConUCB algorithm, showing the potential benefit to recommender systems.

Related papers

Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback [61.49239204705301]
We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model.<n>Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems.
arXiv Detail & Related papers (2025-10-20T02:28:08Z)
Multi-User Contextual Cascading Bandits for Personalized Recommendation [2.506355754272765]
Multi-User Contextual Cascading Bandit model captures realistic online advertising scenarios where multiple users interact with sequentially displayed items simultaneously.<n>We first propose Upper Confidence Bound with Backward Planning (UCBBP), a UCB-style algorithm tailored to this setting, and prove that it achieves a regret bound of $widetildeO(sqrtTHN)$ over $T$ episodes, $H$ session steps, and $N$ contexts per episode.<n>Motivated by the fact that many users interact with the system simultaneously, we introduce a second algorithm, termed Active Upper Confidence Bound with
arXiv Detail & Related papers (2025-08-19T16:14:33Z)
Leveraging the Power of Conversations: Optimal Key Term Selection in Conversational Contextual Bandits [27.62165569135504]
Conversational recommender systems proactively query users with relevant "key terms" and leverage the feedback to elicit users' preferences for personalized recommendations.<n>Existing algorithms employ key term selection strategies with insufficient exploration, often failing to thoroughly probe users' preferences and resulting in suboptimal preference estimation.<n>We propose three novel algorithms: CLiSK, CLiME, and CLiSK-ME.<n>We theoretically prove that all three algorithms achieve a tighter regret upper bound of $O(sqrtdTlogT)$ with respect to the time horizon $T$, improving upon existing methods.
arXiv Detail & Related papers (2025-05-27T16:22:32Z)
Interactive Visualization Recommendation with Hier-SUCB [52.11209329270573]
We propose an interactive personalized visualization recommendation (PVisRec) system that learns on user feedback from previous interactions. For more interactive and accurate recommendations, we propose Hier-SUCB, a contextual semi-bandit in the PVisRec setting.
arXiv Detail & Related papers (2025-02-05T17:14:45Z)
Online Clustering of Dueling Bandits [59.09590979404303]
We introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions.
arXiv Detail & Related papers (2025-02-04T07:55:41Z)
Demystifying Online Clustering of Bandits: Enhanced Exploration Under Stochastic and Smoothed Adversarial Contexts [27.62165569135504]
A line of research, known as online clustering of bandits, extends contextual MAB by grouping similar users into clusters. Existing algorithms, which rely on the upper confidence bound (UCB) strategy, struggle to gather adequate statistical information to accurately identify unknown user clusters. We propose two novel algorithms, UniCLUB and PhaseUniCLUB, which incorporate enhanced exploration mechanisms to accelerate cluster identification.
arXiv Detail & Related papers (2025-01-01T16:38:29Z)
Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning [70.22819290458581]
Reinforcement learning with human feedback (RLHF) is a widely adopted approach in current large language model pipelines. Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries.
arXiv Detail & Related papers (2024-07-02T10:09:19Z)
Revisiting and Maximizing Temporal Knowledge in Semi-supervised Semantic Segmentation [7.005068872406135]
Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems. These approaches frequently involve complex training pipelines and a substantial computational burden. We propose a PrevMatch framework that effectively mitigates the limitations by maximizing the utilization of the temporal knowledge obtained during the training process.
arXiv Detail & Related papers (2024-05-31T03:54:59Z)
Reinforced In-Context Black-Box Optimization [64.25546325063272]
RIBBO is a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion. RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks. Central to our method is to augment the optimization histories with textitregret-to-go tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories.
arXiv Detail & Related papers (2024-02-27T11:32:14Z)
Expert with Clustering: Hierarchical Online Preference Learning Framework [4.05836962263239]
Expert with Clustering (EWC) is a hierarchical contextual bandit framework that integrates clustering techniques and prediction with expert advice. EWC can substantially reduce regret by 27.57% compared to the LinUCB baseline.
arXiv Detail & Related papers (2024-01-26T18:44:49Z)
Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z)
Hierarchical Conversational Preference Elicitation with Bandit Feedback [36.507341041113825]
We formulate a new conversational bandit problem that allows the recommender system to choose either a key-term or an item to recommend at each round. We conduct a survey and analyze a real-world dataset to find that, unlike assumptions made in prior works, key-term rewards are mainly affected by rewards of representative items. We propose two bandit algorithms, Hier-UCB and Hier-LinUCB, that leverage this observed relationship and the hierarchical structure between key-terms and items.
arXiv Detail & Related papers (2022-09-06T05:35:24Z)
Meta-Wrapper: Differentiable Wrapping Operator for User Interest Selection in CTR Prediction [97.99938802797377]
Click-through rate (CTR) prediction, whose goal is to predict the probability of the user to click on an item, has become increasingly significant in recommender systems. Recent deep learning models with the ability to automatically extract the user interest from his/her behaviors have achieved great success. We propose a novel approach under the framework of the wrapper method, which is named Meta-Wrapper.
arXiv Detail & Related papers (2022-06-28T03:28:15Z)
Large-Scale Sequential Learning for Recommender and Engineering Systems [91.3755431537592]
In this thesis, we focus on the design of an automatic algorithms that provide personalized ranking by adapting to the current conditions. For the former, we propose novel algorithm called SAROS that take into account both kinds of feedback for learning over the sequence of interactions. The proposed idea of taking into account the neighbour lines shows statistically significant results in comparison with the initial approach for faults detection in power grid.
arXiv Detail & Related papers (2022-05-13T21:09:41Z)
An Improved Reinforcement Learning Algorithm for Learning to Branch [12.27934038849211]
Branch-and-bound (B&B) is a general and widely used method for optimization. In this paper, we propose a novel reinforcement learning-based B&B algorithm. We evaluate the performance of the proposed algorithm over three public research benchmarks.
arXiv Detail & Related papers (2022-01-17T04:50:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.