To update or not to update? Delayed Nonparametric Bandits with
Randomized Allocation
- URL: http://arxiv.org/abs/2005.13078v1
- Date: Tue, 26 May 2020 23:06:20 GMT
- Title: To update or not to update? Delayed Nonparametric Bandits with
Randomized Allocation
- Authors: Sakshi Arya and Yuhong Yang
- Abstract summary: Delayed rewards problem in contextual bandits has been of interest in various practical settings.
We study randomized allocation strategies and provide an understanding on how the exploration-exploitation tradeoff is affected by delays in observing the rewards.
- Score: 5.9814720629540155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Delayed rewards problem in contextual bandits has been of interest in various
practical settings. We study randomized allocation strategies and provide an
understanding on how the exploration-exploitation tradeoff is affected by
delays in observing the rewards. In randomized strategies, the extent of
exploration-exploitation is controlled by a user-determined exploration
probability sequence. In the presence of delayed rewards, one may choose
between using the original exploration sequence that updates at every time
point or update the sequence only when a new reward is observed, leading to two
competing strategies. In this work, we show that while both strategies may lead
to strong consistency in allocation, the property holds for a wider scope of
situations for the latter. However, for finite sample performance, we
illustrate that both strategies have their own advantages and disadvantages,
depending on the severity of the delay and underlying reward generating
mechanisms.
Related papers
- Likelihood Reward Redistribution [0.0]
We propose a emphLikelihood Reward Redistribution (LRR) framework for reward redistribution.
When integrated with an off-policy algorithm such as Soft Actor-Critic, LRR yields dense and informative reward signals.
arXiv Detail & Related papers (2025-03-20T20:50:49Z) - MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization [91.80034860399677]
Reinforcement learning algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards.
We introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration.
We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits.
arXiv Detail & Related papers (2024-12-16T18:59:53Z) - Reward Augmentation in Reinforcement Learning for Testing Distributed Systems [6.0560257343687995]
Bugs in popular distributed protocol implementations have been the source of many downtimes in popular internet services.
We describe a randomized testing approach for distributed protocol implementations based on reinforcement learning.
We show two different techniques that build on one another.
arXiv Detail & Related papers (2024-09-02T15:07:05Z) - Beyond Optimism: Exploration With Partially Observable Rewards [10.571972176725371]
Exploration in reinforcement learning (RL) remains an open challenge.
We present a novel strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy.
arXiv Detail & Related papers (2024-06-20T00:42:02Z) - Randomized Confidence Bounds for Stochastic Partial Monitoring [8.649322557020666]
Partial monitoring (PM) framework provides a theoretical formulation of sequential learning problems with incomplete feedback.
In contextual PM, the outcomes depend on some side information that is observable by the agent before selecting the action on each round.
We introduce a new class of PM strategies based on the randomization of deterministic confidence bounds.
arXiv Detail & Related papers (2024-02-07T16:18:59Z) - DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - SEREN: Knowing When to Explore and When to Exploit [14.188362393915432]
We introduce Sive Reinforcement Exploration Network (SEREN) that poses the exploration-exploitation trade-off as a game.
Using a form of policies known as impulse control, switcher is able to determine the best set of states to switch to the exploration policy.
We prove that SEREN converges quickly and induces a natural schedule towards pure exploitation.
arXiv Detail & Related papers (2022-05-30T12:44:56Z) - Algorithms for Adaptive Experiments that Trade-off Statistical Analysis
with Reward: Combining Uniform Random Assignment and Reward Maximization [50.725191156128645]
Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments.
We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis.
arXiv Detail & Related papers (2021-12-15T22:11:58Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - Disturbing Reinforcement Learning Agents with Corrupted Rewards [62.997667081978825]
We analyze the effects of different attack strategies based on reward perturbations on reinforcement learning algorithms.
We show that smoothly crafting adversarial rewards are able to mislead the learner, and that using low exploration probability values, the policy learned is more robust to corrupt rewards.
arXiv Detail & Related papers (2021-02-12T15:53:48Z) - Learning from History for Byzantine Robust Optimization [52.68913869776858]
Byzantine robustness has received significant attention recently given its importance for distributed learning.
We show that most existing robust aggregation rules may not converge even in the absence of any Byzantine attackers.
arXiv Detail & Related papers (2020-12-18T16:22:32Z) - Efficient exploration of zero-sum stochastic games [83.28949556413717]
We investigate the increasingly important and common game-solving setting where we do not have an explicit description of the game but only oracle access to it through gameplay.
During a limited-duration learning phase, the algorithm can control the actions of both players in order to try to learn the game and how to play it well.
Our motivation is to quickly learn strategies that have low exploitability in situations where evaluating the payoffs of a queried strategy profile is costly.
arXiv Detail & Related papers (2020-02-24T20:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.