Efficient Explorative Key-term Selection Strategies for Conversational
Contextual Bandits
- URL: http://arxiv.org/abs/2303.00315v2
- Date: Sun, 1 Oct 2023 08:13:39 GMT
- Title: Efficient Explorative Key-term Selection Strategies for Conversational
Contextual Bandits
- Authors: Zhiyong Wang, Xutong Liu, Shuai Li, John C.S. Lui
- Abstract summary: We propose ConLinUCB", a general framework for conversational bandits with better information incorporation.
We also design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR.
Experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54% improvement) and computational efficiency (up to 72% improvement)
- Score: 46.49854998602084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational contextual bandits elicit user preferences by occasionally
querying for explicit feedback on key-terms to accelerate learning. However,
there are aspects of existing approaches which limit their performance. First,
information gained from key-term-level conversations and arm-level
recommendations is not appropriately incorporated to speed up learning. Second,
it is important to ask explorative key-terms to quickly elicit the user's
potential interests in various domains to accelerate the convergence of user
preference estimation, which has never been considered in existing works. To
tackle these issues, we first propose ``ConLinUCB", a general framework for
conversational bandits with better information incorporation, combining
arm-level and key-term-level feedback to estimate user preference in one step
at each time. Based on this framework, we further design two bandit algorithms
with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR.
We prove tighter regret upper bounds of our proposed algorithms. Particularly,
ConLinUCB-BS achieves a regret bound of $O(d\sqrt{T\log T})$, better than the
previous result $O(d\sqrt{T}\log T)$. Extensive experiments on synthetic and
real-world data show significant advantages of our algorithms in learning
accuracy (up to 54\% improvement) and computational efficiency (up to 72\%
improvement), compared to the classic ConUCB algorithm, showing the potential
benefit to recommender systems.
Related papers
- Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning [70.22819290458581]
Reinforcement learning with human feedback (RLHF) is a widely adopted approach in current large language model pipelines.
Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries.
arXiv Detail & Related papers (2024-07-02T10:09:19Z) - Revisiting and Maximizing Temporal Knowledge in Semi-supervised Semantic Segmentation [7.005068872406135]
Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems.
These approaches frequently involve complex training pipelines and a substantial computational burden.
We propose a PrevMatch framework that effectively mitigates the limitations by maximizing the utilization of the temporal knowledge obtained during the training process.
arXiv Detail & Related papers (2024-05-31T03:54:59Z) - Reinforced In-Context Black-Box Optimization [64.25546325063272]
RIBBO is a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion.
RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks.
Central to our method is to augment the optimization histories with textitregret-to-go tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories.
arXiv Detail & Related papers (2024-02-27T11:32:14Z) - Expert with Clustering: Hierarchical Online Preference Learning Framework [4.05836962263239]
Expert with Clustering (EWC) is a hierarchical contextual bandit framework that integrates clustering techniques and prediction with expert advice.
EWC can substantially reduce regret by 27.57% compared to the LinUCB baseline.
arXiv Detail & Related papers (2024-01-26T18:44:49Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Hierarchical Conversational Preference Elicitation with Bandit Feedback [36.507341041113825]
We formulate a new conversational bandit problem that allows the recommender system to choose either a key-term or an item to recommend at each round.
We conduct a survey and analyze a real-world dataset to find that, unlike assumptions made in prior works, key-term rewards are mainly affected by rewards of representative items.
We propose two bandit algorithms, Hier-UCB and Hier-LinUCB, that leverage this observed relationship and the hierarchical structure between key-terms and items.
arXiv Detail & Related papers (2022-09-06T05:35:24Z) - Meta-Wrapper: Differentiable Wrapping Operator for User Interest
Selection in CTR Prediction [97.99938802797377]
Click-through rate (CTR) prediction, whose goal is to predict the probability of the user to click on an item, has become increasingly significant in recommender systems.
Recent deep learning models with the ability to automatically extract the user interest from his/her behaviors have achieved great success.
We propose a novel approach under the framework of the wrapper method, which is named Meta-Wrapper.
arXiv Detail & Related papers (2022-06-28T03:28:15Z) - Large-Scale Sequential Learning for Recommender and Engineering Systems [91.3755431537592]
In this thesis, we focus on the design of an automatic algorithms that provide personalized ranking by adapting to the current conditions.
For the former, we propose novel algorithm called SAROS that take into account both kinds of feedback for learning over the sequence of interactions.
The proposed idea of taking into account the neighbour lines shows statistically significant results in comparison with the initial approach for faults detection in power grid.
arXiv Detail & Related papers (2022-05-13T21:09:41Z) - An Improved Reinforcement Learning Algorithm for Learning to Branch [12.27934038849211]
Branch-and-bound (B&B) is a general and widely used method for optimization.
In this paper, we propose a novel reinforcement learning-based B&B algorithm.
We evaluate the performance of the proposed algorithm over three public research benchmarks.
arXiv Detail & Related papers (2022-01-17T04:50:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.