Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related
Rewards
- URL: http://arxiv.org/abs/2307.09093v1
- Date: Tue, 18 Jul 2023 09:22:33 GMT
- Title: Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related
Rewards
- Authors: Saeed Ghoorchian and Setareh Maghsudi
- Abstract summary: We formalize a non-stationary and delayed semi-bandit problem with causally related rewards.
We develop a policy that learns the structural dependencies from delayed feedback and utilizes that to optimize the decision-making.
We evaluate our method via numerical analysis using synthetic and real-world datasets to detect the regions that contribute the most to the spread of Covid-19 in Italy.
- Score: 7.0997346625024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequential decision-making under uncertainty is often associated with long
feedback delays. Such delays degrade the performance of the learning agent in
identifying a subset of arms with the optimal collective reward in the long
run. This problem becomes significantly challenging in a non-stationary
environment with structural dependencies amongst the reward distributions
associated with the arms. Therefore, besides adapting to delays and
environmental changes, learning the causal relations alleviates the adverse
effects of feedback delay on the decision-making process. We formalize the
described setting as a non-stationary and delayed combinatorial semi-bandit
problem with causally related rewards. We model the causal relations by a
directed graph in a stationary structural equation model. The agent maximizes
the long-term average payoff, defined as a linear function of the base arms'
rewards. We develop a policy that learns the structural dependencies from
delayed feedback and utilizes that to optimize the decision-making while
adapting to drifts. We prove a regret bound for the performance of the proposed
algorithm. Besides, we evaluate our method via numerical analysis using
synthetic and real-world datasets to detect the regions that contribute the
most to the spread of Covid-19 in Italy.
Related papers
- Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms.
We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths.
We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related
Rewards [5.347237827669861]
We study the piecewise stationary semi-bandit problem with causally related rewards.
In our nonstationary environment, variations in the base arms' distributions, causal relationships between rewards, or both, change the reward generation process.
The problem becomes aggravated in the semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms.
arXiv Detail & Related papers (2023-07-26T12:06:13Z) - Linear Combinatorial Semi-Bandit with Causally Related Rewards [5.347237827669861]
We propose a policy that determines the causal relations by learning the network's topology.
We establish a sublinear regret bound for the proposed algorithm.
arXiv Detail & Related papers (2022-12-25T16:05:21Z) - Gated Recurrent Neural Networks with Weighted Time-Delay Feedback [59.125047512495456]
We introduce a novel gated recurrent unit (GRU) with a weighted time-delay feedback mechanism.
We show that $tau$-GRU can converge faster and generalize better than state-of-the-art recurrent units and gated recurrent architectures.
arXiv Detail & Related papers (2022-12-01T02:26:34Z) - Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds.
We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z) - Break your Bandit Routine with LSD Rewards: a Last Switch Dependent
Analysis of Satiation and Seasonality [6.146046338698175]
We introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions.
Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function.
We prove an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy.
arXiv Detail & Related papers (2021-10-22T14:53:13Z) - Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions [54.25616645675032]
We study the Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm.
We consider two settings: the reward-dependent delay setting, where realized delays may depend on the rewards, and the reward-independent delay setting.
Our main contribution is algorithms that achieve near-optimal regret in each of the settings.
arXiv Detail & Related papers (2021-06-04T12:26:06Z) - Stochastic bandits with arm-dependent delays [102.63128271054741]
We propose a simple but efficient UCB-based algorithm called the PatientBandits.
We provide both problems-dependent and problems-independent bounds on the regret as well as performance lower bounds.
arXiv Detail & Related papers (2020-06-18T12:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.