Related papers: Bandits with Preference Feedback: A Stackelberg Game Perspective

Bandits with Preference Feedback: A Stackelberg Game Perspective

URL: http://arxiv.org/abs/2406.16745v2
Date: Wed, 30 Oct 2024 17:10:52 GMT
Title: Bandits with Preference Feedback: A Stackelberg Game Perspective
Authors: Barna Pásztor, Parnian Kassraie, Andreas Krause,
Abstract summary: Bandits with preference feedback present a powerful tool for optimizing unknown target functions. We propose MAXMINLCB, which emulates a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards.
Score: 41.928798759636216
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Bandits with preference feedback present a powerful tool for optimizing unknown target functions when only pairwise comparisons are allowed instead of direct value queries. This model allows for incorporating human feedback into online inference and optimization and has been employed in systems for fine-tuning large language models. The problem is well understood in simplified settings with linear target functions or over finite small domains that limit practical interest. Taking the next step, we consider infinite domains and nonlinear (kernelized) rewards. In this setting, selecting a pair of actions is quite challenging and requires balancing exploration and exploitation at two levels: within the pair, and along the iterations of the algorithm. We propose MAXMINLCB, which emulates this trade-off as a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards. MAXMINLCB consistently outperforms existing algorithms and satisfies an anytime-valid rate-optimal regret guarantee. This is due to our novel preference-based confidence sequences for kernelized logistic estimators.

Related papers

Deep Minimax Classifiers for Imbalanced Datasets with a Small Number of Minority Samples [5.217870815854702]
We propose a novel minimax learning algorithm designed to minimize the risk of worst-performing classes. Our proposed algorithm has a provable convergence property, and empirical results indicate that our algorithm performs better than or is comparable to existing methods.
arXiv Detail & Related papers (2025-02-24T08:20:02Z)
Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity [59.75300530380427]
We consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries. We provide the first tight characterization for the rate of the minimax simple regret by developing matching upper and lower bounds.
arXiv Detail & Related papers (2024-06-28T02:56:22Z)
Step-level Value Preference Optimization for Mathematical Reasoning [6.318873143509028]
We introduce a novel algorithm called Step-level Value Preference Optimization (SVPO) Our method achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks.
arXiv Detail & Related papers (2024-06-16T09:06:17Z)
$ε$-Optimally Solving Zero-Sum POSGs [4.424170214926035]
A recent method for solving zero-sum partially observable games embeds the original game into a new one called the occupancy Markov game. This paper exploits the optimal value function's novel uniform continuity properties to overcome this limitation.
arXiv Detail & Related papers (2024-05-29T08:34:01Z)
PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits [29.097522376094624]
We propose a simple and computationally efficient sparse linear estimation method called PopArt. We derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art.
arXiv Detail & Related papers (2022-10-25T19:13:20Z)
Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy. To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN. Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z)
Efficient Exploration in Binary and Preferential Bayesian Optimization [0.5076419064097732]
We show that it is important for BO algorithms to distinguish between different types of uncertainty. We propose several new acquisition functions that outperform state-of-the-art BO functions.
arXiv Detail & Related papers (2021-10-18T14:44:34Z)
Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary. Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z)
Modeling the Second Player in Distributionally Robust Optimization [90.25995710696425]
We argue for the use of neural generative models to characterize the worst-case distribution. This approach poses a number of implementation and optimization challenges. We find that the proposed approach yields models that are more robust than comparable baselines.
arXiv Detail & Related papers (2021-03-18T14:26:26Z)
Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z)
A Convergent and Dimension-Independent Min-Max Optimization Algorithm [32.492526162436405]
We show that a distribution which the min-player uses to update its parameters depends on a smooth and bounded nonnon-concave function. Our algorithm converges to an approximate local equilibrium in a number of iterations that does not depend on the iterations.
arXiv Detail & Related papers (2020-06-22T16:11:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.