Reward Imputation with Sketching for Contextual Batched Bandits
- URL: http://arxiv.org/abs/2210.06719v3
- Date: Sat, 7 Oct 2023 08:49:31 GMT
- Title: Reward Imputation with Sketching for Contextual Batched Bandits
- Authors: Xiao Zhang, Ninglu Shao, Zihua Si, Jun Xu, Wenhan Wang, Hanjing Su,
Ji-Rong Wen
- Abstract summary: Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode.
Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information.
We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
- Score: 48.80803376405073
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Contextual batched bandit (CBB) is a setting where a batch of rewards is
observed from the environment at the end of each episode, but the rewards of
the non-executed actions are unobserved, resulting in partial-information
feedback. Existing approaches for CBB often ignore the rewards of the
non-executed actions, leading to underutilization of feedback information. In
this paper, we propose an efficient approach called Sketched Policy Updating
with Imputed Rewards (SPUIR) that completes the unobserved rewards using
sketching, which approximates the full-information feedbacks. We formulate
reward imputation as an imputation regularized ridge regression problem that
captures the feedback mechanisms of both executed and non-executed actions. To
reduce time complexity, we solve the regression problem using randomized
sketching. We prove that our approach achieves an instantaneous regret with
controllable bias and smaller variance than approaches without reward
imputation. Furthermore, our approach enjoys a sublinear regret bound against
the optimal policy. We also present two extensions, a rate-scheduled version
and a version for nonlinear rewards, making our approach more practical.
Experimental results show that SPUIR outperforms state-of-the-art baselines on
synthetic, public benchmark, and real-world datasets.
Related papers
- Contextual Linear Optimization with Bandit Feedback [35.692428244561626]
Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients.
We study a class of offline learning algorithms for CLO with bandit feedback.
We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate.
arXiv Detail & Related papers (2024-05-26T13:27:27Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - Break your Bandit Routine with LSD Rewards: a Last Switch Dependent
Analysis of Satiation and Seasonality [6.146046338698175]
We introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions.
Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function.
We prove an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy.
arXiv Detail & Related papers (2021-10-22T14:53:13Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Scalable Bayesian Inverse Reinforcement Learning [93.27920030279586]
We introduce Approximate Variational Reward Imitation Learning (AVRIL)
Our method addresses the ill-posed nature of the inverse reinforcement learning problem.
Applying our method to real medical data alongside classic control simulations, we demonstrate Bayesian reward inference in environments beyond the scope of current methods.
arXiv Detail & Related papers (2021-02-12T12:32:02Z) - Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous
Feedback [32.62857394584907]
We study the multi-armed bandit (MAB) problem with composite and anonymous feedback.
We propose adaptive algorithms for both the adversarial and non- adversarial cases.
arXiv Detail & Related papers (2020-12-13T12:25:41Z) - Reinforcement Learning with Trajectory Feedback [76.94405309609552]
In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as emphtrajectory feedback.
Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory.
We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.
arXiv Detail & Related papers (2020-08-13T17:49:18Z) - Counterfactual Evaluation of Slate Recommendations with Sequential
Reward Interactions [18.90946044396516]
Music streaming, video streaming, news recommendation, and e-commerce services often engage with content in a sequential manner.
Providing and evaluating good sequences of recommendations is therefore a central problem for these services.
We propose a new counterfactual estimator that allows for sequential interactions in the rewards with lower variance in anally unbiased manner.
arXiv Detail & Related papers (2020-07-25T17:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.