Related papers: Kernelized Offline Contextual Dueling Bandits

Kernelized Offline Contextual Dueling Bandits

URL: http://arxiv.org/abs/2307.11288v1
Date: Fri, 21 Jul 2023 01:17:31 GMT
Title: Kernelized Offline Contextual Dueling Bandits
Authors: Viraj Mehta and Ojash Neopane and Vikramjeet Das and Sen Lin and Jeff Schneider and Willie Neiswanger
Abstract summary: In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound.
Score: 15.646879026749168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts.

Related papers

Adaptive Querying for Reward Learning from Human Feedback [5.587293092389789]
Learning from human feedback is a popular approach to train robots to adapt to user preferences and improve safety. We examine how to learn a penalty function associated with unsafe behaviors, such as side effects, using multiple forms of human feedback. We employ an iterative, two-phase approach which first selects critical states for querying, and then uses information gain to select a feedback format for querying.
arXiv Detail & Related papers (2024-12-11T00:02:48Z)
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment. We show that demonstrating its superiority to coarse-grained feedback is not automatic. We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z)
Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference. Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z)
READ: Improving Relation Extraction from an ADversarial Perspective [33.44949503459933]
We propose an adversarial training method specifically designed for relation extraction (RE) Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations.
arXiv Detail & Related papers (2024-04-02T16:42:44Z)
RLVF: Learning from Verbal Feedback without Overgeneralization [94.19501420241188]
We study the problem of incorporating verbal feedback without such overgeneralization. We develop a new method Contextualized Critiques with Constrained Preference Optimization (C3PO) Our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts.
arXiv Detail & Related papers (2024-02-16T18:50:24Z)
Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration [29.935758027209292]
Preference-based feedback is important for many applications in reinforcement learning. In this work, we take advantage of the fact that one can often choose contexts to obtain human feedback. We show that our method is able to reach better performance with fewer samples of human preferences than multiple baselines.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT. We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z)
Partial Bandit and Semi-Bandit: Making the Most Out of Scarce Users' Feedback [62.997667081978825]
We present a novel approach for considering user feedback and evaluate it using three distinct strategies. Despite a limited number of feedbacks returned by users (as low as 20% of the total), our approach obtains similar results to those of state of the art approaches.
arXiv Detail & Related papers (2020-09-16T07:32:51Z)
Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data. Can we learn effective policies via supervised learning without demonstrations? We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.