Related papers: Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

URL: http://arxiv.org/abs/2505.22361v1
Date: Wed, 28 May 2025 13:41:00 GMT
Title: Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles
Authors: Xiangyu Chang, Xi Chen, Yining Wang, Zhiyi Zeng,
Abstract summary: We study a bandit optimization problem where the goal is to maximize a function $f(x)$ over $T$ periods.<n>We show that such a pairwise comparison finds important applications to joint pricing and inventory replenishment problems.
Score: 14.070618685107645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper studies a bandit optimization problem where the goal is to maximize a function $f(x)$ over $T$ periods for some unknown strongly concave function $f$. We consider a new pairwise comparison oracle, where the decision-maker chooses a pair of actions $(x, x')$ for a consecutive number of periods and then obtains an estimate of $f(x)-f(x')$. We show that such a pairwise comparison oracle finds important applications to joint pricing and inventory replenishment problems and network revenue management. The challenge in this bandit optimization is twofold. First, the decision-maker not only needs to determine a pair of actions $(x, x')$ but also a stopping time $n$ (i.e., the number of queries based on $(x, x')$). Second, motivated by our inventory application, the estimate of the difference $f(x)-f(x')$ is biased, which is different from existing oracles in stochastic optimization literature. To address these challenges, we first introduce a discretization technique and local polynomial approximation to relate this problem to linear bandits. Then we developed a tournament successive elimination technique to localize the discretized cell and run an interactive batched version of LinUCB algorithm on cells. We establish regret bounds that are optimal up to poly-logarithmic factors. Furthermore, we apply our proposed algorithm and analytical framework to the two operations management problems and obtain results that improve state-of-the-art results in the existing literature.

Related papers

Learning-Augmented Algorithms for Boolean Satisfiability [7.642039348547126]
We study the classic Boolean satisfiability (SAT) decision and optimization problems using two forms of advice.<n>Subset advice provides a random $epsilon$ fraction of the variables from an optimal assignment, whereas label advice" provides noisy predictions for all variables in an optimal assignment.<n>For the optimization problem, we show how to incorporate subset advice in a black-box fashion with any $alpha$-approximation algorithm.
arXiv Detail & Related papers (2025-05-09T15:54:34Z)
Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization [11.11876897168701]
We consider the problem of learning in adversarial Markov decision processes with an oblivious adversary.<n>We propose an algorithm, called APO-MVP, that achieves a regret bound of order $tildemathcalO(mathrmpoly(H)sqrtSAT)$.
arXiv Detail & Related papers (2024-07-08T08:06:45Z)
Stopping Bayesian Optimization with Probabilistic Regret Bounds [1.4141453107129403]
We investigate replacing a de facto stopping rule with criteria based on the probability that a point satisfies a given set of conditions.<n>We give a practical algorithm for evaluating Monte Carlo stopping rules in a manner that is both sample efficient and robust to estimation error.
arXiv Detail & Related papers (2024-02-26T18:34:58Z)
Towards Efficient and Optimal Covariance-Adaptive Algorithms for Combinatorial Semi-Bandits [12.674929126684528]
We address the problem of semi-bandits, where a player selects among P actions from the power set of a set containing d base items. We show that our approach efficiently leverages the semi-bandit feedback and outperforms bandit feedback approaches.
arXiv Detail & Related papers (2024-02-23T08:07:54Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting. Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function. We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z)
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other. We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z)
Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information [78.78486761923855]
In many real world problems, we want to infer some property of an expensive black-box function f, given a budget of T function evaluations. We present a procedure, InfoBAX, that sequentially chooses queries that maximize mutual information with respect to the algorithm's output. On these problems, InfoBAX uses up to 500 times fewer queries to f than required by the original algorithm.
arXiv Detail & Related papers (2021-04-19T17:22:11Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
Empirical Risk Minimization in the Non-interactive Local Model of Differential Privacy [26.69391745812235]
We study the Empirical Risk Minimization (ERM) problem in the noninteractive Local Differential Privacy (LDP) model. Previous research indicates that the sample complexity, to achieve error $alpha, needs to be depending on dimensionality $p$ for general loss functions.
arXiv Detail & Related papers (2020-11-11T17:48:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.