Efficient Explorative Key-term Selection Strategies for Conversational
Contextual Bandits
- URL: http://arxiv.org/abs/2303.00315v2
- Date: Sun, 1 Oct 2023 08:13:39 GMT
- Title: Efficient Explorative Key-term Selection Strategies for Conversational
Contextual Bandits
- Authors: Zhiyong Wang, Xutong Liu, Shuai Li, John C.S. Lui
- Abstract summary: We propose ConLinUCB", a general framework for conversational bandits with better information incorporation.
We also design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR.
Experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54% improvement) and computational efficiency (up to 72% improvement)
- Score: 46.49854998602084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational contextual bandits elicit user preferences by occasionally
querying for explicit feedback on key-terms to accelerate learning. However,
there are aspects of existing approaches which limit their performance. First,
information gained from key-term-level conversations and arm-level
recommendations is not appropriately incorporated to speed up learning. Second,
it is important to ask explorative key-terms to quickly elicit the user's
potential interests in various domains to accelerate the convergence of user
preference estimation, which has never been considered in existing works. To
tackle these issues, we first propose ``ConLinUCB", a general framework for
conversational bandits with better information incorporation, combining
arm-level and key-term-level feedback to estimate user preference in one step
at each time. Based on this framework, we further design two bandit algorithms
with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR.
We prove tighter regret upper bounds of our proposed algorithms. Particularly,
ConLinUCB-BS achieves a regret bound of $O(d\sqrt{T\log T})$, better than the
previous result $O(d\sqrt{T}\log T)$. Extensive experiments on synthetic and
real-world data show significant advantages of our algorithms in learning
accuracy (up to 54\% improvement) and computational efficiency (up to 72\%
improvement), compared to the classic ConUCB algorithm, showing the potential
benefit to recommender systems.
Related papers
- Interactive Visualization Recommendation with Hier-SUCB [52.11209329270573]
We propose an interactive personalized visualization recommendation (PVisRec) system that learns on user feedback from previous interactions.
For more interactive and accurate recommendations, we propose Hier-SUCB, a contextual semi-bandit in the PVisRec setting.
arXiv Detail & Related papers (2025-02-05T17:14:45Z) - Online Clustering of Dueling Bandits [59.09590979404303]
We introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback.
We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions.
arXiv Detail & Related papers (2025-02-04T07:55:41Z) - Demystifying Online Clustering of Bandits: Enhanced Exploration Under Stochastic and Smoothed Adversarial Contexts [27.62165569135504]
A line of research, known as online clustering of bandits, extends contextual MAB by grouping similar users into clusters.
Existing algorithms, which rely on the upper confidence bound (UCB) strategy, struggle to gather adequate statistical information to accurately identify unknown user clusters.
We propose two novel algorithms, UniCLUB and PhaseUniCLUB, which incorporate enhanced exploration mechanisms to accelerate cluster identification.
arXiv Detail & Related papers (2025-01-01T16:38:29Z) - Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning [70.22819290458581]
Reinforcement learning with human feedback (RLHF) is a widely adopted approach in current large language model pipelines.
Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries.
arXiv Detail & Related papers (2024-07-02T10:09:19Z) - Expert with Clustering: Hierarchical Online Preference Learning Framework [4.05836962263239]
Expert with Clustering (EWC) is a hierarchical contextual bandit framework that integrates clustering techniques and prediction with expert advice.
EWC can substantially reduce regret by 27.57% compared to the LinUCB baseline.
arXiv Detail & Related papers (2024-01-26T18:44:49Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Hierarchical Conversational Preference Elicitation with Bandit Feedback [36.507341041113825]
We formulate a new conversational bandit problem that allows the recommender system to choose either a key-term or an item to recommend at each round.
We conduct a survey and analyze a real-world dataset to find that, unlike assumptions made in prior works, key-term rewards are mainly affected by rewards of representative items.
We propose two bandit algorithms, Hier-UCB and Hier-LinUCB, that leverage this observed relationship and the hierarchical structure between key-terms and items.
arXiv Detail & Related papers (2022-09-06T05:35:24Z) - Large-Scale Sequential Learning for Recommender and Engineering Systems [91.3755431537592]
In this thesis, we focus on the design of an automatic algorithms that provide personalized ranking by adapting to the current conditions.
For the former, we propose novel algorithm called SAROS that take into account both kinds of feedback for learning over the sequence of interactions.
The proposed idea of taking into account the neighbour lines shows statistically significant results in comparison with the initial approach for faults detection in power grid.
arXiv Detail & Related papers (2022-05-13T21:09:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.