Non-Stationary Delayed Bandits with Intermediate Observations
- URL: http://arxiv.org/abs/2006.02119v2
- Date: Tue, 11 Aug 2020 16:09:01 GMT
- Title: Non-Stationary Delayed Bandits with Intermediate Observations
- Authors: Claire Vernade, Andras Gyorgy, and Timothy Mann
- Abstract summary: Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics.
We introduce the problem of non-stationary, delayed bandits with intermediate observations.
We develop an efficient algorithm based on UCRL, and prove sublinear regret guarantees for its performance.
- Score: 10.538264213183076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online recommender systems often face long delays in receiving feedback,
especially when optimizing for some long-term metrics. While mitigating the
effects of delays in learning is well-understood in stationary environments,
the problem becomes much more challenging when the environment changes. In
fact, if the timescale of the change is comparable to the delay, it is
impossible to learn about the environment, since the available observations are
already obsolete. However, the arising issues can be addressed if intermediate
signals are available without delay, such that given those signals, the
long-term behavior of the system is stationary. To model this situation, we
introduce the problem of stochastic, non-stationary, delayed bandits with
intermediate observations. We develop a computationally efficient algorithm
based on UCRL, and prove sublinear regret guarantees for its performance.
Experimental results demonstrate that our method is able to learn in
non-stationary delayed environments where existing methods fail.
Related papers
- Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Neural Laplace Control for Continuous-time Delayed Systems [76.81202657759222]
We propose a continuous-time model-based offline RL method that combines a Neural Laplace dynamics model with a model predictive control (MPC) planner.
We show experimentally on continuous-time delayed environments it is able to achieve near expert policy performance.
arXiv Detail & Related papers (2023-02-24T12:40:28Z) - DaDe: Delay-adaptive Detector for Streaming Perception [0.0]
In real-time environment, surrounding environment changes when processing is over.
Streaming perception is proposed to assess the latency and accuracy of real-time video perception.
We develop a model that can reflect processing delays in real time and produce the most reasonable results.
arXiv Detail & Related papers (2022-12-22T09:25:46Z) - Intrinsic Anomaly Detection for Multi-Variate Time Series [33.199682596741276]
Intrinsic anomalies are changes in the functional dependency structure between time series that represent an environment and time series that represent the internal state of a system that is placed in said environment.
These address the short-comings of existing anomaly detection methods that cannot differentiate between expected changes in the system's state and unexpected ones, i.e., changes in the system that deviate from the environment's influence.
Our most promising approach is fully unsupervised and combines adversarial learning and time series representation learning, thereby addressing problems such as label sparsity and subjectivity.
arXiv Detail & Related papers (2022-06-29T00:51:44Z) - Delay-adaptive step-sizes for asynchronous learning [8.272788656521415]
We show that it is possible to use learning rates that depend on the actual time-varying delays in the system.
For each of these methods, we demonstrate how delays can be measured on-line, present delay-adaptive step-size policies, and illustrate their theoretical and practical advantages over the state-of-the-art.
arXiv Detail & Related papers (2022-02-17T09:51:22Z) - Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions [54.25616645675032]
We study the Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm.
We consider two settings: the reward-dependent delay setting, where realized delays may depend on the rewards, and the reward-independent delay setting.
Our main contribution is algorithms that achieve near-optimal regret in each of the settings.
arXiv Detail & Related papers (2021-06-04T12:26:06Z) - Detecting Rewards Deterioration in Episodic Reinforcement Learning [63.49923393311052]
In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible.
We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov.
We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power.
arXiv Detail & Related papers (2020-10-22T12:45:55Z) - Reinforcement Learning with Random Delays [14.707955337702943]
We show that partially resampling trajectory fragments in hindsight allows for off-policy multi-step value estimation.
We apply this principle to derive Delay-Correcting Actor-Critic (DCAC), an algorithm based on Soft Actor-Critic with significantly better performance in environments with delays.
arXiv Detail & Related papers (2020-10-06T18:39:23Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z) - Time-varying Gaussian Process Bandit Optimization with Non-constant
Evaluation Time [93.6788993843846]
We propose a novel time-varying Bayesian optimization algorithm that can effectively handle the non-constant evaluation time.
Our bound elucidates that a pattern of the evaluation time sequence can hugely affect the difficulty of the problem.
arXiv Detail & Related papers (2020-03-10T13:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.