Related papers: Vector preference-based contextual bandits under distributional shifts

Vector preference-based contextual bandits under distributional shifts

URL: http://arxiv.org/abs/2508.15966v1
Date: Thu, 21 Aug 2025 21:19:16 GMT
Title: Vector preference-based contextual bandits under distributional shifts
Authors: Apurv Shukla, P. R. Kumar,
Abstract summary: We consider contextual bandit learning under distribution shift.<n>We propose an adaptive-discretization and optimistic elimination based policy that self-tunes to the underlying distribution shift.
Score: 8.19666118455293
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We consider contextual bandit learning under distribution shift when reward vectors are ordered according to a given preference cone. We propose an adaptive-discretization and optimistic elimination based policy that self-tunes to the underlying distribution shift. To measure the performance of this policy, we introduce the notion of preference-based regret which measures the performance of a policy in terms of distance between Pareto fronts. We study the performance of this policy by establishing upper bounds on its regret under various assumptions on the nature of distribution shift. Our regret bounds generalize known results for the existing case of no distribution shift and vectorial reward settings, and scale gracefully with problem parameters in presence of distribution shifts.

Related papers

Proxy Methods for Domain Adaptation [78.03254010884783]
proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables. We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings.
arXiv Detail & Related papers (2024-03-12T09:32:41Z)
Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set. Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity. For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related Rewards [5.347237827669861]
We study the piecewise stationary semi-bandit problem with causally related rewards. In our nonstationary environment, variations in the base arms' distributions, causal relationships between rewards, or both, change the reward generation process. The problem becomes aggravated in the semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms.
arXiv Detail & Related papers (2023-07-26T12:06:13Z)
Learning under random distributional shifts [0.0]
We consider a class of random distribution shift models that capture arbitrary changes in the underlying covariate space. We show that the hybrid approach is robust to the strength of the distribution shift and the proxy relationship. In two high-impact domains, we find that the proposed approach results in substantially lower mean-squared error.
arXiv Detail & Related papers (2023-06-05T15:14:34Z)
Distributional Multi-Objective Decision Making [2.185694185279913]
We take a distributional approach and introduce a novel dominance criterion relating return distributions of policies directly. We propose a novel algorithm to learn the distributional undominated set and further contribute pruning operators to reduce the set to the convex distributional undominated set.
arXiv Detail & Related papers (2023-05-09T15:47:56Z)
Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment. We first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z)
Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret [5.1398743023989555]
We study the nonstationary Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget.
arXiv Detail & Related papers (2021-01-22T07:34:09Z)
Off-Policy Evaluation of Slate Policies under Bayes Risk [70.10677881866047]
We study the problem of off-policy evaluation for slate bandits, for the typical case in which the logging policy factorizes over the slots of the slate. We show that the risk improvement over PI grows linearly with the number of slots, and linearly with the gap between the arithmetic and the harmonic mean of a set of slot-level divergences.
arXiv Detail & Related papers (2021-01-05T20:07:56Z)
Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs) Semi-implicit actor (SIA) powered by a flexible policy distribution. We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)
Adaptive Discretization against an Adversary: Lipschitz bandits, Dynamic Pricing, and Auction Tuning [56.23358327635815]
Lipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces.<n>A central theme here is the adaptive discretization of the action space, which gradually zooms in'' on the more promising regions.<n>We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds.
arXiv Detail & Related papers (2020-06-22T16:06:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.